Automatic Loop Interchange

Size: px

Start display at page:

Download "Automatic Loop Interchange"

Christiana Washington
5 years ago
Views:

1 RETROSPECTIVE: Automatic Loop Interchange Randy Allen Catalytic Compilers 1900 Embarcadero Rd, #206 Palo Alto, CA Ken Kennedy Department of Computer Science Rice University Houston, TX Retrospectives provide a rare and interesting opportunity to reflect upon the past and recall (or more accurately after a couple of decades, speculate upon) our state of mind and understandings at earlier times. Automatic Loop Interchange was published almost 20 years ago at a midpoint in research on data dependence and program transformations. This paper pulled together the work of many predecessors [9,10,11,15] into a simple, clean theory, providing a checkpoint on earlier predictions on the power and applicability of data dependence. At the same time, the paper was published just as the field of data dependence entered a golden age. It is our hope that this paper helped catalyze the following research into data dependence theory and applications. A sabbatical at IBM catalyzed our initial interest in automatic vectorization. We started development with a version of the Parafrase system built at the University of Illinois by Dave Kuck and his group (including Michael Wolfe) [9,15], but we began work on an entirely new system in 1981 to provide a better platform for our research on multilevel code generation [8] (eventually published in 1987 [3]). The new system became known as the Parallel Fortran Converter (PFC). At the time we started the PFC project, few programmers had access to vectorizing compilers that used data dependence. Vector units and vectorizing compilers were employed exclusively on expensive high-end machines or specialized array processors, which were available to only a small percentage of the general programming public. Despite this limited access, vectorizing compilers had already earned the informal nicknames paralyzers and terrorizers due to their large compile times and often less-than-optimal output. We began our effort with modest expectations. PFC was deliberately structured as a source-to-source translator, primarily because we believed the algorithms that we wanted to employ would require more compile-time than could be justified in a production compiler. We also doubted the power of data dependence, and expected that we would need to employ techniques from artificial intelligence to achieve satisfactory results. This paper marked a point in PFC s development where our early assumptions had been proved wrong. What PFC and this paper had shown was that a fairly simple set of program transformations based on a unified underlying theory could provide effective restructuring without requiring unacceptable compile times. 20 Years of the ACM/SIGPLAN Conference on Programming Language Design and Implementation ( ): A Selection, Copyright 2003 ACM $5.00 Foundation Although this paper is entitled Automatic Loop Interchange, it is far broader in scope. As an introduction to interchange, the paper also covers a wide spectrum of dependence-based theory and transformations. This work is built on the efforts of many others, and we would be remiss if we did not acknowledge at least some of those efforts acknowledging all of them would quickly blow our page limits. The earliest papers on dependence-based program transformations include papers by Lamport [10,11] and Kuck [9]. Lamport developed a form of loop interchange for use in vectorization, as well as the wavefront method for parallelization, an early form of what came to be called loop skewing. As indicated earlier, we also had access to the Parafrase system and the associated body of research. In particular, Michael Wolfe s Master s thesis focused on loop interchange [15], a topic that he developed further in later works [16,17]. Our own work on the subject began with our multilevel code generation strategy [8, 1, 2, 3], which we implemented in the summer of Real implementations often provide incredible insights into the weaknesses of theoretical approaches; this was definitely true in the case of PFC. The code generation strategy proved extremely effective in practice and was far more efficient in terms of compile-time than we had anticipated. 1 However, a real implementation quickly showed us that loop interchange was the key missing piece. While PFC performed well in terms of the vectorization it detected, we quickly saw that loop interchange was the key incremental transformation. The practical strategy presented in the paper ( innermosting loops that carried no dependence, testing loops that carried dependences for interchange only to the next deeper position) evolved out of discussions with Randy Scarborough, Joe Warren, and others in the PFC project. The strategy reported in this paper was implemented in the PFC system. Although we reported no experimental results in the paper, a later study reviewed in our book [4] showed that PFC was able to do extremely well on the Callahan, Dongarra, and Levine vectorization tests [6]. 1 At that time, we had to pay for computer time by the CPUminute. The first time that we tried a large test case (roughly 1000 lines of code), Ken insisted that we limit the CPU time to 10 minutes (which was still several thousand dollars of computer time) to avoid blowing our research budget. We didn t expect the test case to complete in the time limit; when it took only 40 seconds, we assumed that PFC had crashed processing the input. It took us a day of wading through the output to verify that it had in fact completely and correctly processed the test. ACM SIGPLAN 75 Best of PLDI

2 Impact The approaches to dependence and loop interchange presented in this paper were soon incorporated into a number of commercial compilers. We are directly aware of the implementations in the IBM compiler for the 3090 Vector Feature [13] and the Convex vectorizing compiler, and were involved in the implementation of the Ardent restructuring compilers. Beyond the immediate practical impact, Automatic Loop Interchange also established interchange as a fundamental transformation in all advanced optimizing compilers: vectorizing, parallelizing, and even scalar. While many previous papers had focused on dependences as execution constraints that limit reordering, this paper (in a section devoted to other applications of dependence) also pointed out the dual aspect: dependences represented reused memory locations. Accordingly, dependence provided a basis for optimizing for memory hierarchies by moving the most frequently accessed memory locations into the fastest elements of the hierarchy. Later research would prove loop interchange to be as important for moving dependences into inner loops (thereby optimizing memory reuse) as it had proven to be for moving dependences out of inner loops (as was necessary for vectorizing loops). Particularly important exemplars of this research are the papers by Callahan, Carr, and Kennedy on register optimization [6] and by Wolf and Lam on cache blocking [14]. Both papers are included in this volume. Practical implementations that included this aspect of dependence include the Ardent compiler [5]. Future Applications Looking back over the past 18 years, we doubt that we would have predicted the impact of loop interchange on the compiler literature. Although our own work and the work of others went on to more powerful transformation strategies based on direction and distance matrices [4, 14, 16, 17], this work was one of the first to establish that powerful and effective program transformations could be implemented in practical compiler systems. Of course, one reason for the growth in importance of this work is the increased use of parallelism in computer architecture and the increasing disparity between CPU and memory speeds. Looking to the future, we believe these factors are only going to increase in the design of computer systems, making these compiler techniques even more relevant. Memory hierarchies in particular are increasingly dominating computation times, and automatic loop interchange is a key transformation for exploiting that hierarchy. While loop interchange has been thoroughly explored in the context of restructuring compilers, there are other contexts which have not been so thoroughly explored. For instance, given the intimate relationship between dependence and loop iterations, it is natural to assume that dependence and loop interchange should have as important a role to play in the design of pipelined architectures as it does in exploiting pipelined architectures. Bibliography 1. J.R. Allen. Dependence analysis for subscripted variables and its application to program transformations. Ph.D dissertation, Department of Mathematical Sciences, Rice University, May, J. R. Allen and K. Kennedy. PFC: a program to convert Fortran to parallel form. In Supercomputers: Design and Applications, K. Hwang, editor, pages IEEE Computer Society Press, August J. R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4): , October R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, R. Allen. Unifying vectorization, parallelization, and optimization: the Ardent compiler. In Proceedings of the Third International Conference on Supercomputing, D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In PLDI 90 (also included in this volume). 7. D. Callahan, J. Dongarra, and D. Levine. Vectorizing compilers: A test suite and results. In Proceedings of Supercomputing 88, Orlando, FL, K. Kennedy. Automatic translation of Fortran programs to vector form. Rice Technical Report , Department of Mathematical Sciences, Rice University, D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. J. Wolfe. Dependence graphs and compiler optimizations. In Conference Record of the Eighth Annual ACM Symposium on the Principles of Programming Languages, Williamsburg, VA, January L. Lamport. The parallel execution of DO loops. Communications of the ACM, 17(2):83 93, February L. Lamport. The coordinate method for the parallel execution of iterative DO loops. Technical Report CA , SRI, Menlo Park, CA, August 1976, revised October D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12): , December R. G. Scarborough and H. G. Kolsky. A vectorizing FORTRAN compiler. IBM Journal of Research and Development, March M. E. Wolf and M. Lam. A data locality optimizing algorithm. In PLDI 91 (also included in this volume). 15. M. J. Wolfe. Techniques for improving the inherent parallelism in programs. Master s thesis, Dept.of Computer Science, University of Illinois at Urbana-Champaign, July M. J. Wolfe. Advanced loop interchanging. In Proceedings of the 1986 International Conference on Parallel Processing, St. Charles, IL, August M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA, Acknowledgements As was the case at the time the paper was published, this work has progressed over the years only by the efforts and collaborations of others far too numerous to list here. However, we would be remiss if we did not acknowledge the contributions of Randy Scarborough, Joe Warren, Horace Flatt, and all the graduate students who worked on PFC. ACM SIGPLAN 76 Best of PLDI

3 ACM SIGPLAN 77 Best of PLDI

4 ACM SIGPLAN 78 Best of PLDI

5 ACM SIGPLAN 79 Best of PLDI

6 ACM SIGPLAN 80 Best of PLDI

7 ACM SIGPLAN 81 Best of PLDI

8 ACM SIGPLAN 82 Best of PLDI

9 ACM SIGPLAN 83 Best of PLDI

10 ACM SIGPLAN 84 Best of PLDI

11 ACM SIGPLAN 85 Best of PLDI

12 ACM SIGPLAN 86 Best of PLDI

13 ACM SIGPLAN 87 Best of PLDI

14 ACM SIGPLAN 88 Best of PLDI

15 ACM SIGPLAN 89 Best of PLDI

16 ACM SIGPLAN 90 Best of PLDI

Compiler Optimisation

Compiler Optimisation 8 Dependence Analysis Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Introduction This