Publications

Using the Loop Chain Abstraction to Schedule across Loops in Existing Code

Published in International Journal of High Performance Computing and Networking, 2018

Abstract

Exposing opportunities for parallelisation while explicitly managing data locality is the primary challenge to porting and optimising computational science simulation codes to improve performance. OpenMP provides mechanisms for expressing parallelism, but it remains the programmer’s responsibility to group computations to improve data locality. The loop chain abstraction, where a summary of data access patterns is included as pragmas associated with parallel loops, provides compilers with sufficient information to automate the parallelism versus data locality trade-off. We present the syntax and semantics of loop chain pragmas for indicating information about loops belonging to the loop chain and specification of a high-level schedule for the loop chain. We show example usage of the pragmas, detail attempts to automate the transformation of a legacy scientific code written with specific language constraints to loop chain codes, describe the compiler implementation for loop chain pragmas, and exhibit performance results for a computational fluid dynamics benchmark.

Recommended citation: Ian Bertolacci, Michelle Strout, Jordan Riley, Stephen Guzik, Eddie Davis, Catherine Olschanowsky, "Using the Loop Chain Abstraction to Schedule across Loops in Existing Code." International Journal of High Performance Computing and Networking, 2018. https://www.inderscienceonline.com/doi/abs/10.1504/IJHPCN.2019.097053

Extending OpenMP to Facilitate Loop Optimization

Published in In the proceedings of Evolving OpenMP for Evolving Architectures, 2018

Abstract

OpenMP provides several mechanisms to specify parallel source-code transformations. Unfortunately, many compilers perform these transformations early in the translation process, often before performing traditional sequential optimizations, which can limit the effectiveness of those optimizations. Further, OpenMP semantics preclude performing those transformations in some cases prior to the parallel transformations, which can limit overall application performance.

Recommended citation: Ian Bertolacci, Michelle Strout, Bronis Supinski, Thomas Scogland, Eddie Davis, Catherine Olschanowsky, "Extending OpenMP to Facilitate Loop Optimization." In the proceedings of Evolving OpenMP for Evolving Architectures, 2018. http://link.springer.com/10.1007/978-3-319-98521-3_4

Identifying and Scheduling Loop Chains Using Directives

Published in In the proceedings of 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), 2016

Abstract

Exposing opportunities for parallelization while explicitly managing data locality is the primary challenge to porting and optimizing existing computational science simulation codes to improve performance and accuracy. OpenMP provides many mechanisms for expressing parallelism, but it primarily remains the programmer’s responsibility to group computations to improve data locality. The loopchain abstraction, where data access patterns are included with the specification of parallel loops, provides compilers with sufficient information to automate the parallelism versus data locality tradeoff. In this paper, we present a loop chain pragma and an extension to the omp for to enable the specification of loop chains and high-level specifications of schedules on loop chains. We show example usage of the extensions, describe their implementation, and show preliminary performance results for some simple examples.

Recommended citation: I. Bertolacci, M. Strout, S. Guzik, J. Riley, C. Olschanowsky, "Identifying and Scheduling Loop Chains Using Directives." In the proceedings of 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), 2016. https://ieeexplore.ieee.org/abstract/document/7836581/

Parameterized Diamond Tiling for Stencil Computations with Chapel Parallel Iterators

Published in In the proceedings of Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Abstract

Stencil computations figure prominently in the core kernels of many scientific computations, such as partial differential equation solvers. Parallel scaling of stencil computations can be significantly improved on multicore processors using advanced tiling techniques that include the time dimension, such as diamond tiling. Such techniques are difficult to include in general purpose optimizing compilers because of the need for inter-procedural pointer and array data-flow analysis, plus the need to tune scheduling strategies and tile size parameters for each pairing of stencil computation and machine. Since a fully automatic solution is problematic, we propose to provide parameterized space and time tiling iterators through libraries. Ideally, the execution schedule or tiling code will be expressed orthogonally to the computation. This supports code reuse, easier tuning, and improved programmer productivity. Chapel iterators provide this capability implicitly. We present an advanced, parameterized tiling approach that we have implemented using Chapel parallel iterators. We show how such iterators can be used by programmers in stencil computations with multiple spatial dimensions. We also demonstrate that these new iterators provide better scaling than a traditional data parallel schedule.

Recommended citation: Ian Bertolacci, Catherine Olschanowsky, Ben Harshbarger, Bradford Chamberlain, David Wonnacott, Michelle Strout, "Parameterized Diamond Tiling for Stencil Computations with Chapel Parallel Iterators." In the proceedings of Proceedings of the 29th ACM on International Conference on Supercomputing, 2015. http://doi.acm.org/10.1145/2751205.2751226