An Efficient Evolutionary Algorithm for Solving Incrementally Structured Problems

Similar documents
Language and Compiler Support for Auto-Tuning Variable-Accuracy Algorithms

PetaBricks: Variable Accuracy and Online Learning

Data Structures and Algorithms

Sorting. CMPS 2200 Fall Carola Wenk Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk

Linear Selection and Linear Sorting

Genetic Algorithm: introduction

Analytical Modeling of Parallel Systems

Algorithms. Algorithms 2.2 MERGESORT. mergesort bottom-up mergesort sorting complexity divide-and-conquer ROBERT SEDGEWICK KEVIN WAYNE

Sorting Algorithms. We have already seen: Selection-sort Insertion-sort Heap-sort. We will see: Bubble-sort Merge-sort Quick-sort

Data Structures and Algorithms

Sorting. CS 3343 Fall Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk

Lecture 1: Asymptotics, Recurrences, Elementary Sorting

Sorting. Chapter 11. CSE 2011 Prof. J. Elder Last Updated: :11 AM

Sorting algorithms. Sorting algorithms

Fundamental Algorithms

Fundamental Algorithms

Algorithms and Their Complexity

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14

Insertion Sort. We take the first element: 34 is sorted

COMP 382: Reasoning about algorithms

Fast Algorithms for Segmented Regression

CS 700: Quantitative Methods & Experimental Design in Computer Science

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Chapter 8: Introduction to Evolutionary Computation

Outline. 1 Merging. 2 Merge Sort. 3 Complexity of Sorting. 4 Merge Sort and Other Sorts 2 / 10

Central Algorithmic Techniques. Iterative Algorithms

COL 730: Parallel Programming

Data selection. Lower complexity bound for sorting

Sorting and Searching. Tony Wong

What s the best data structure for multivariate polynomials in a world of 64 bit multicore computers?

Data Compression Techniques

Chapter 5. Divide and Conquer CLRS 4.3. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Introduction to Randomized Algorithms III

Information-Based Similarity Index

MIDTERM I CMPS Winter 2013 Warmuth

CS 361, Lecture 14. Outline

POLY : A new polynomial data structure for Maple 17 that improves parallel speedup.

Mat Week 6. Fall Mat Week 6. Algorithms. Properties. Examples. Searching. Sorting. Time Complexity. Example. Properties.

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Bin Sort. Sorting integers in Range [1,...,n] Add all elements to table and then

Algorithms. Jordi Planes. Escola Politècnica Superior Universitat de Lleida

Student Responsibilities Week 6. Mat Properties of Algorithms. 3.1 Algorithms. Finding the Maximum Value in a Finite Sequence Pseudocode

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Parallel Transposition of Sparse Data Structures

Gecco 2007 Tutorial / Grammatical Evolution

POLY : A new polynomial data structure for Maple.

Lecture 2: MergeSort. CS 341: Algorithms. Thu, Jan 10 th 2019

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Sparse Polynomial Multiplication and Division in Maple 14

ITEC2620 Introduction to Data Structures

CPS 616 DIVIDE-AND-CONQUER 6-1

Mathematical Induction

4: Some Model Programming Tasks

Lecture 4. Quicksort

CMPSCI611: Three Divide-and-Conquer Examples Lecture 2

Algorithms and Data Structures

Algorithms, Design and Analysis. Order of growth. Table 2.1. Big-oh. Asymptotic growth rate. Types of formulas for basic operation count

CSCE 222 Discrete Structures for Computing

Review Of Topics. Review: Induction

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Chapter 5. Divide and Conquer CLRS 4.3. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

A Brief Introduction to Multiobjective Optimization Techniques

Lower and Upper Bound Theory

Dense Arithmetic over Finite Fields with CUMODP

Modeling and Tuning Parallel Performance in Dense Linear Algebra

On Adaptive Integer Sorting

ELEMENTARY NUMBER THEORY AND METHODS OF PROOF

MA008/MIIZ01 Design and Analysis of Algorithms Lecture Notes 3

NUMBERS AND CODES CHAPTER Numbers

Efficient Application of Countermeasures for Elliptic Curve Cryptography

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2

Strategies for Stable Merge Sorting

When to Use Bit-Wise Neutrality

POLY : A new polynomial data structure for Maple 17 that improves parallel speedup.

Copyright 2000, Kevin Wayne 1

ECE260: Fundamentals of Computer Engineering

Applying Machine Learning for Gravitational-wave Burst Data Analysis

An analogy from Calculus: limits

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material

The Viterbi Algorithm EECS 869: Error Control Coding Fall 2009

! Break up problem into several parts. ! Solve each part recursively. ! Combine solutions to sub-problems into overall solution.

Comp 11 Lectures. Mike Shah. July 26, Tufts University. Mike Shah (Tufts University) Comp 11 Lectures July 26, / 45

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

Evolutionary Design I

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD

Fast Sorting and Selection. A Lower Bound for Worst Case

Multimedia Communications. Scalar Quantization

Sorting DS 2017/2018

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

! Insert. ! Remove largest. ! Copy. ! Create. ! Destroy. ! Test if empty. ! Fraud detection: isolate $$ transactions.

CMPT 307 : Divide-and-Conqer (Study Guide) Should be read in conjunction with the text June 2, 2015

Highly-scalable branch and bound for maximum monomial agreement

Design and Analysis of Algorithms

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Embedded probabilistic programming

Measurement & Performance

Measurement & Performance

MA/CSSE 473 Day 05. Factors and Primes Recursive division algorithm

Transcription:

An Efficient Evolutionary Algorithm for Solving Incrementally Structured Problems Jason Ansel Maciej Pacula Saman Amarasinghe Una-May O Reilly MIT - CSAIL July 14, 2011 Jason Ansel (MIT) PetaBricks July 14, 2011 1 / 30

Who are we? I do research in programming languages (PL) and compilers The PetaBricks language is a collaboration between: A PL / compiler research group A evolutionary algorithms research group A applied mathematics research group Jason Ansel (MIT) PetaBricks July 14, 2011 2 / 30

Who are we? I do research in programming languages (PL) and compilers The PetaBricks language is a collaboration between: A PL / compiler research group A evolutionary algorithms research group A applied mathematics research group Our goal is to make programs run faster We use evolutionary algorithms to search for faster programs Jason Ansel (MIT) PetaBricks July 14, 2011 2 / 30

Who are we? I do research in programming languages (PL) and compilers The PetaBricks language is a collaboration between: A PL / compiler research group A evolutionary algorithms research group A applied mathematics research group Our goal is to make programs run faster We use evolutionary algorithms to search for faster programs The PetaBricks language defines search spaces of algorithmic choices Jason Ansel (MIT) PetaBricks July 14, 2011 2 / 30

A motivating example How would you write a fast sorting algorithm? Jason Ansel (MIT) PetaBricks July 14, 2011 3 / 30

A motivating example How would you write a fast sorting algorithm? Insertion sort Quick sort Merge sort Radix sort Jason Ansel (MIT) PetaBricks July 14, 2011 3 / 30

A motivating example How would you write a fast sorting algorithm? Insertion sort Quick sort Merge sort Radix sort Binary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort, Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort, Heapsort, Introsort, Library sort, Odd-even sort, Postman sort, Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort, Timsort? Jason Ansel (MIT) PetaBricks July 14, 2011 3 / 30

A motivating example How would you write a fast sorting algorithm? Insertion sort Quick sort Merge sort Radix sort Binary tree sort, Bitonic sort, Bubble sort, Bucket sort, Burstsort, Cocktail sort, Comb sort, Counting Sort, Distribution sort, Flashsort, Heapsort, Introsort, Library sort, Odd-even sort, Postman sort, Samplesort, Selection sort, Shell sort, Stooge sort, Strand sort, Timsort? Poly-algorithms Jason Ansel (MIT) PetaBricks July 14, 2011 3 / 30

std::stable sort /usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367 Jason Ansel (MIT) PetaBricks July 14, 2011 4 / 30

std::stable sort /usr/include/c++/4.5.2/bits/stl algo.h lines 3350-3367 Jason Ansel (MIT) PetaBricks July 14, 2011 4 / 30

Is 15 the right number? The best cutoff (CO) changes Depends on competing costs: Cost of computation (< operator, call overhead, etc) Cost of communication (swaps) Cache behavior (misses, prefetcher, locality) Jason Ansel (MIT) PetaBricks July 14, 2011 5 / 30

Is 15 the right number? The best cutoff (CO) changes Depends on competing costs: Cost of computation (< operator, call overhead, etc) Cost of communication (swaps) Cache behavior (misses, prefetcher, locality) Sorting 100000 doubles with std::stable sort: CO 200 optimal on a Phenom 905e (15% speedup over CO = 15) CO 400 optimal on a Opteron 6168 (15% speedup over CO = 15) CO 500 optimal on a Xeon E5320 (34% speedup over CO = 15) CO 700 optimal on a Xeon X5460 (25% speedup over CO = 15) Jason Ansel (MIT) PetaBricks July 14, 2011 5 / 30

Is 15 the right number? The best cutoff (CO) changes Depends on competing costs: Cost of computation (< operator, call overhead, etc) Cost of communication (swaps) Cache behavior (misses, prefetcher, locality) Sorting 100000 doubles with std::stable sort: CO 200 optimal on a Phenom 905e (15% speedup over CO = 15) CO 400 optimal on a Opteron 6168 (15% speedup over CO = 15) CO 500 optimal on a Xeon E5320 (34% speedup over CO = 15) CO 700 optimal on a Xeon X5460 (25% speedup over CO = 15) If the best cutoff has changed, perhaps best algorithm has also changed Jason Ansel (MIT) PetaBricks July 14, 2011 5 / 30

Algorithmic choices Language e i t h e r { I n s e r t i o n S o r t ( out, i n ) ; } or { QuickSort ( out, i n ) ; } or { MergeSort ( out, i n ) ; } or { R a d i x S o r t ( out, i n ) ; } Jason Ansel (MIT) PetaBricks July 14, 2011 6 / 30

Algorithmic choices Language e i t h e r { I n s e r t i o n S o r t ( out, i n ) ; } or { QuickSort ( out, i n ) ; } or { MergeSort ( out, i n ) ; } or { R a d i x S o r t ( out, i n ) ; } Representation Decision tree synthesized by our evolutionary algorithm Jason Ansel (MIT) PetaBricks July 14, 2011 6 / 30

Decision trees Optimized for a Xeon E7340 (8 cores): N < 600 Insertion Sort N < 1420 Quick Sort Merge Sort (2-way) Text notation (will be used later): I 600 Q 1420 M 2 Jason Ansel (MIT) PetaBricks July 14, 2011 7 / 30

Decision trees Optimized for Sun Fire T200 Niagara (8 cores): N < 1461 N < 75 N < 2400 Merge Sort (8-way) Merge Sort (4-way) Merge Sort (2-way) Text notation: M 16 75 M 8 1461 M 4 2400 M 2 Jason Ansel (MIT) PetaBricks July 14, 2011 8 / 30

The configuration encoded by the genome Decision trees Algorithm parameters (integers, floats) Parallel scheduling / blocking parameters (integers) Synthesized scalar functions (not used in the benchmarks shown) The average PetaBricks benchmark s genome has: 1.9 decision trees 10.1 algorithm/parallelism/blocking parameters 0.6 synthesized scalar functions 2 3107 possible configurations Jason Ansel (MIT) PetaBricks July 14, 2011 9 / 30

Outline 1 PetaBricks Language 2 Autotuning Problem 3 INCREA 4 Evaluation 5 Conclusions Jason Ansel (MIT) PetaBricks July 14, 2011 10 / 30

PetaBricks programs at runtime Request Program Response Jason Ansel (MIT) PetaBricks July 14, 2011 11 / 30

PetaBricks programs at runtime Request Program Response Configuration: - point in ~100D space Measurement: - performance - accuracy (QoS) Jason Ansel (MIT) PetaBricks July 14, 2011 11 / 30

PetaBricks programs at runtime Request Program Response Configuration: - point in ~100D space Measurement: - performance - accuracy (QoS) Offline Autotuning Jason Ansel (MIT) PetaBricks July 14, 2011 11 / 30

The challenges Evaluating objective function is expensive Must run the program (at least once) More expensive for unfit solutions Scales poorly with larger problem sizes Fitness is noisy Randomness from parallel races and system noise Testing each candidate only once often produces an worse algorithm Running many trials is expensive Decision tree structures are complex Theoretically infinite size We artificially bound them to 2 736 bits (23 ints) each Jason Ansel (MIT) PetaBricks July 14, 2011 12 / 30

Contrast two evolutionary approaches GPEA: General Purpose Evolutionary Algorithm Used as a baseline INCREA: Incremental Evolutionary Algorithm Bottom-up approach Noisy fitness evaluation strategy Domain informed mutation operators Jason Ansel (MIT) PetaBricks July 14, 2011 13 / 30

General purpose evolution algorithm (GPEA) Initial population???? Cost = 0 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s??? Cost = 72.7 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s?? Cost = 83.2 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s? Cost = 87.3 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5 Generation 2???? Cost = 0 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5 Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5 Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1 Generation 3???? Cost = 0 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5 Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1 Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5 Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1 Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0 Generation 4???? Cost = 0 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5 Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1 Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0 Generation 4 0.3s 0.1s 0.4s 2.4s Cost = 3.2 Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5 Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1 Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0 Generation 4 0.3s 0.1s 0.4s 2.4s Cost = 3.2 Cost of autotuning front-loaded in initial (unfit) population We could speed up tuning if we start with a faster initial population Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

General purpose evolution algorithm (GPEA) Initial population 72.7s 10.5s 4.1s 31.2s Cost = 118.5 Generation 2 4.2s 5.1s 2.6s 13.2s Cost = 25.1 Generation 3 2.8s 0.1s 3.8s 2.3s Cost = 9.0 Generation 4 0.3s 0.1s 0.4s 2.4s Cost = 3.2 Cost of autotuning front-loaded in initial (unfit) population We could speed up tuning if we start with a faster initial population Key insight Smaller input sizes can be used to form better initial population Jason Ansel (MIT) PetaBricks July 14, 2011 14 / 30

Bottom-up evolutionary algorithm Train on input size 64 Jason Ansel (MIT) PetaBricks July 14, 2011 15 / 30

Bottom-up evolutionary algorithm Train on input size 32, to form initial population for: Train on input size 64 Jason Ansel (MIT) PetaBricks July 14, 2011 15 / 30

Bottom-up evolutionary algorithm Train on input size 16, to form initial population for: Train on input size 32, to form initial population for: Train on input size 64 Jason Ansel (MIT) PetaBricks July 14, 2011 15 / 30

Bottom-up evolutionary algorithm Train on input size 1, to form initial population for: Train on input size 2, to form initial population for: Train on input size 8, to form initial population for: Train on input size 16, to form initial population for: Train on input size 32, to form initial population for: Train on input size 64 Naturally exploits optimal substructure of problems Jason Ansel (MIT) PetaBricks July 14, 2011 15 / 30

Noisy fitness evaluation Both strategies terminate slow tests early GPEA uses 1 trial per candidate algorithm INCREA adaptively changes the number of trials Represents fitness as a probability distribution Runs a single tailed t-test to get confidence in differences Runs more trails if confidence is low Jason Ansel (MIT) PetaBricks July 14, 2011 16 / 30

Domain informed mutation operators Mutation operators deal with larger structures in the genome Add algorithm Y to the top of decision tree X Scale cutoff X using a lognormal distribution Generated fully automatically by our compiler Jason Ansel (MIT) PetaBricks July 14, 2011 17 / 30

Outline 1 PetaBricks Language 2 Autotuning Problem 3 INCREA 4 Evaluation 5 Conclusions Jason Ansel (MIT) PetaBricks July 14, 2011 18 / 30

Experimental Setup Measuring convergence time Important to both program users and developers Vital in online autotuning Three fixed-accuracy PetaBricks programs: Sort 2 20 (small input size) Sort 2 23 (large input size) Matrix multiply Eigenvector solve Representative runs Average of 30 runs, with tests for statistical significance in paper Run an 8-core Xeon running Debian 5.0 Jason Ansel (MIT) PetaBricks July 14, 2011 19 / 30

Sort 2 20 : training input size 1e+07 1e+06 Training Input Size 100000 10000 1000 100 10 1 INCREA GPEA 60 120 240 480 960 1920 Training Time Jason Ansel (MIT) PetaBricks July 14, 2011 20 / 30

Sort 2 20 : candidates tested 10000 INCREA GPEA Tests Conducted 1000 100 60 120 240 480 960 1920 Training Time Jason Ansel (MIT) PetaBricks July 14, 2011 21 / 30

Sort 2 20 : performance Best Candidate (s) 0.5 0.4 0.3 0.2 0.1 INCREA GPEA 0 60 120 240 480 960 1920 Training Time Jason Ansel (MIT) PetaBricks July 14, 2011 22 / 30

Sort 2 23 : performance Best Candidate (s) 1 0.8 0.6 0.4 0.2 INCREA GPEA 0 60 180 540 1620 4860 14580 Training Time Jason Ansel (MIT) PetaBricks July 14, 2011 23 / 30

INCREA: Sort, best algorithm at each generation Input size Training time (s) Genome 2 0 6.9 Q 64 Q p 2 1 14.6 Q 64 Q p 2 2 26.6 I... 2 7 115.7 I 2 8 138.6 I 270 R 1310 R p 2 9 160.4 I 270 Q 1310 Q p 2 10 190.1 I 270 Q 1310 Q p 2 11 216.4 I 270 Q 3343 Q p 2 12 250.0 I 189 R 13190 R p 2 13 275.5 I 189 R 13190 R p 2 14 307.6 I 189 R 17131 R p 2 15 341.9 I 189 R 49718 R p 2 16 409.3 I 189 R 124155 M 2 2 17 523.4 I 189 Q 5585 Q p 2 18 642.9 I 189 Q 5585 Q p 2 19 899.8 I 456 Q 5585 Q p 2 20 1313.8 I 456 Q 5585 Q p I = insertion-sort Q = quick-sort R = radix-sort M x = x-way merge-sort p indicates run in parallel Jason Ansel (MIT) PetaBricks July 14, 2011 24 / 30

INCREA: Sort, best algorithm at each generation Input size Training time (s) Genome 2 0 6.9 Q 64 Q p 2 1 14.6 Q 64 Q p 2 2 26.6 I... 2 7 115.7 I 2 8 138.6 I 270 R 1310 R p 2 9 160.4 I 270 Q 1310 Q p 2 10 190.1 I 270 Q 1310 Q p 2 11 216.4 I 270 Q 3343 Q p 2 12 250.0 I 189 R 13190 R p 2 13 275.5 I 189 R 13190 R p 2 14 307.6 I 189 R 17131 R p 2 15 341.9 I 189 R 49718 R p 2 16 409.3 I 189 R 124155 M 2 2 17 523.4 I 189 Q 5585 Q p 2 18 642.9 I 189 Q 5585 Q p 2 19 899.8 I 456 Q 5585 Q p 2 20 1313.8 I 456 Q 5585 Q p I = insertion-sort Q = quick-sort R = radix-sort M x = x-way merge-sort p indicates run in parallel Jason Ansel (MIT) PetaBricks July 14, 2011 24 / 30

INCREA: Sort, best algorithm at each generation Input size Training time (s) Genome 2 0 6.9 Q 64 Q p 2 1 14.6 Q 64 Q p 2 2 26.6 I... 2 7 115.7 I 2 8 138.6 I 270 R 1310 R p 2 9 160.4 I 270 Q 1310 Q p 2 10 190.1 I 270 Q 1310 Q p 2 11 216.4 I 270 Q 3343 Q p 2 12 250.0 I 189 R 13190 R p 2 13 275.5 I 189 R 13190 R p 2 14 307.6 I 189 R 17131 R p 2 15 341.9 I 189 R 49718 R p 2 16 409.3 I 189 R 124155 M 2 2 17 523.4 I 189 Q 5585 Q p 2 18 642.9 I 189 Q 5585 Q p 2 19 899.8 I 456 Q 5585 Q p 2 20 1313.8 I 456 Q 5585 Q p I = insertion-sort Q = quick-sort R = radix-sort M x = x-way merge-sort p indicates run in parallel Jason Ansel (MIT) PetaBricks July 14, 2011 24 / 30

INCREA: Sort, best algorithm at each generation Input size Training time (s) Genome 2 0 6.9 Q 64 Q p 2 1 14.6 Q 64 Q p 2 2 26.6 I... 2 7 115.7 I 2 8 138.6 I 270 R 1310 R p 2 9 160.4 I 270 Q 1310 Q p 2 10 190.1 I 270 Q 1310 Q p 2 11 216.4 I 270 Q 3343 Q p 2 12 250.0 I 189 R 13190 R p 2 13 275.5 I 189 R 13190 R p 2 14 307.6 I 189 R 17131 R p 2 15 341.9 I 189 R 49718 R p 2 16 409.3 I 189 R 124155 M 2 2 17 523.4 I 189 Q 5585 Q p 2 18 642.9 I 189 Q 5585 Q p 2 19 899.8 I 456 Q 5585 Q p 2 20 1313.8 I 456 Q 5585 Q p I = insertion-sort Q = quick-sort R = radix-sort M x = x-way merge-sort p indicates run in parallel Jason Ansel (MIT) PetaBricks July 14, 2011 24 / 30

INCREA: Sort, best algorithm at each generation Input size Training time (s) Genome 2 0 6.9 Q 64 Q p 2 1 14.6 Q 64 Q p 2 2 26.6 I... 2 7 115.7 I 2 8 138.6 I 270 R 1310 R p 2 9 160.4 I 270 Q 1310 Q p 2 10 190.1 I 270 Q 1310 Q p 2 11 216.4 I 270 Q 3343 Q p 2 12 250.0 I 189 R 13190 R p 2 13 275.5 I 189 R 13190 R p 2 14 307.6 I 189 R 17131 R p 2 15 341.9 I 189 R 49718 R p 2 16 409.3 I 189 R 124155 M 2 2 17 523.4 I 189 Q 5585 Q p 2 18 642.9 I 189 Q 5585 Q p 2 19 899.8 I 456 Q 5585 Q p 2 20 1313.8 I 456 Q 5585 Q p I = insertion-sort Q = quick-sort R = radix-sort M x = x-way merge-sort p indicates run in parallel Jason Ansel (MIT) PetaBricks July 14, 2011 24 / 30

INCREA: Sort, best algorithm at each generation Input size Training time (s) Genome 2 0 6.9 Q 64 Q p 2 1 14.6 Q 64 Q p 2 2 26.6 I... 2 7 115.7 I 2 8 138.6 I 270 R 1310 R p 2 9 160.4 I 270 Q 1310 Q p 2 10 190.1 I 270 Q 1310 Q p 2 11 216.4 I 270 Q 3343 Q p 2 12 250.0 I 189 R 13190 R p 2 13 275.5 I 189 R 13190 R p 2 14 307.6 I 189 R 17131 R p 2 15 341.9 I 189 R 49718 R p 2 16 409.3 I 189 R 124155 M 2 2 17 523.4 I 189 Q 5585 Q p 2 18 642.9 I 189 Q 5585 Q p 2 19 899.8 I 456 Q 5585 Q p 2 20 1313.8 I 456 Q 5585 Q p I = insertion-sort Q = quick-sort R = radix-sort M x = x-way merge-sort p indicates run in parallel Jason Ansel (MIT) PetaBricks July 14, 2011 24 / 30

GPEA: Sort, best algorithm at each generation Generation Training time (s) Genome 0 91.4 I 448 R 1 133.2 I 413 R 2 156.5 I 448 R 3 174.8 I 448 Q 4 192.0 I 448 Q 5 206.8 I 448 Q 6 222.9 I 448 Q 4096 Q p 7 238.3 I 448 Q 4096 Q p 8 253.0 I 448 Q 4096 Q p 9 266.9 I 448 Q 4096 Q p 10 281.1 I 371 Q 4096 Q p 11 296.3 I 272 Q 4096 Q p 12 310.8 I 272 Q 4096 Q p... 27 530.2 I 272 Q 4096 Q p 28 545.6 I 272 Q 4096 Q p 29 559.5 I 370 Q 8192 Q p 30 574.3 I 370 Q 8192 Q p... I = insertion-sort Q = quick-sort R = radix-sort M x = x-way merge-sort p indicates run in parallel Jason Ansel (MIT) PetaBricks July 14, 2011 25 / 30

Matrix Multiply (input size 1024x1024) Best Candidate (s) 3 2.5 2 1.5 1 0.5 INCREA GPEA 0 60 120 240 480 960 1920 3840 7680 Training Time Jason Ansel (MIT) PetaBricks July 14, 2011 26 / 30

Eigenvector Solve (input size 1024x1024) Best Candidate (s) 3 2.5 2 1.5 INCREA GPEA 1 60 240 960 3840 15360 Training Time Jason Ansel (MIT) PetaBricks July 14, 2011 27 / 30

Outline 1 PetaBricks Language 2 Autotuning Problem 3 INCREA 4 Evaluation 5 Conclusions Jason Ansel (MIT) PetaBricks July 14, 2011 28 / 30

Conclusions Take away The technique of solving incrementally structured problems by exploiting knowledge from smaller problem instances may be more broadly applicable. Take away PetaBricks is a useful framework for comparing techniques for autotuning programs. Jason Ansel (MIT) PetaBricks July 14, 2011 29 / 30

Thanks! Questions? http://projects.csail.mit.edu/petabricks/ Jason Ansel (MIT) PetaBricks July 14, 2011 30 / 30