# Publications - Elmar Peise

### Journal Articles

**The ELAPS Framework: Experimental Linear Algebra Performance Studies**International Journal of High Performance Computing Applications, pp. 1-13, March 2018.@article{Peise2018:560, author = "Elmar Peise and Paolo Bientinesi", title = "The ELAPS Framework: Experimental Linear Algebra Performance Studies", journal = "International Journal of High Performance Computing Applications", year = 2018, pages = "1-13", month = mar, doi = "10.1177/1094342018763042", publisher = "Sage", url = "http://arxiv.org/pdf/1504.08035v1" }

abstractwebPDFbibtexIn scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.**Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection**ACM Transactions on Mathematical Software (TOMS), Volume 44(2), pp. 16:1-16:19, September 2017.@article{Peise2017:728, author = "Elmar Peise and Paolo Bientinesi", title = "Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, volume = 44, number = 2, pages = "16:1--16:19", month = sep, doi = "10.1145/3061664", publisher = "ACM", address = "New York, NY, USA", url = "http://arxiv.org/pdf/1602.06763v1" }

abstractwebPDFbibtexTo exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACK's blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.**High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods**Computer Physics Communications, Volume 211, pp. 61 - 72, February 2017.

High Performance Computing for Advanced Modeling and Simulation of Materials.@article{Di_Napoli2017:318, author = "Edoardo {Di Napoli} and Elmar Peise and Markus Hrywniak and Paolo Bientinesi", title = "High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods", journal = "Computer Physics Communications", year = 2017, volume = 211, pages = "61 - 72", month = feb, note = "High Performance Computing for Advanced Modeling and Simulation of Materials", doi = "10.1016/j.cpc.2016.10.003", url = "http://arxiv.org/pdf/1602.06589v2" }

abstractwebPDFbibtexOne of the greatest effort of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical operations in the model in terms of low level optimized kernels offering both performance and portability. Legacy codes suffers from the additional curse of rigid design choices based on outdated performance metrics (e.g. minimization of memory footprint). Using a representative code from the Materials Science community, we propose a methodology to restructure the most expensive operations in terms of an optimized combination of dense linear algebra kernels. The resulting algorithm guarantees an increased performance and an extended life span of this code enabling larger scale simulations.**High Performance Solutions for Big-data GWAS**Parallel Computing, Volume 42, pp. 75 - 87, February 2015.

Special issue on Parallelism in Bioinformatics.@article{Peise2015:754, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "High Performance Solutions for Big-data GWAS", journal = "Parallel Computing", year = 2015, volume = 42, pages = "75 - 87", month = feb, note = "Special issue on Parallelism in Bioinformatics", doi = "10.1016/j.parco.2014.09.005", url = "http://arxiv.org/pdf/1403.6426v1" }

abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.**High-Performance Solvers for Dense Hermitian Eigenproblems**SIAM Journal on Scientific Computing, Volume 35(1), pp. C1-C22, January 2013.@article{Petschow2013:208, author = "Matthias Petschow and Elmar Peise and Paolo Bientinesi", title = "High-Performance Solvers for Dense Hermitian Eigenproblems", journal = "SIAM Journal on Scientific Computing", year = 2013, volume = 35, number = 1, pages = "C1-C22", month = jan, doi = "10.1137/110848803", publisher = "Society for Industrial and Applied Mathematics", url = "http://arxiv.org/pdf/1205.2107.pdf" }

abstractwebPDFbibtexWe introduce a new collection of solvers - subsequently called EleMRRR - for large-scale dense Hermitian eigenproblems. EleMRRR solves various types of problems: generalized, standard, and tridiagonal eigenproblems. Among these, the last is of particular importance as it is a solver on its own right, as well as the computational kernel for the first two; we present a fast and scalable tridiagonal solver based on the Algorithm of Multiple Relatively Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers, PMRRR is part of the freely available Elemental library, and is designed to fully support both message-passing (MPI) and multithreading parallelism (SMP). As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACK's solvers on two supercomputers. Such a study, performed with up to 8,192 cores, provides precise guidelines to assemble the fastest solver within the ScaLAPACK framework; it also indicates that EleMRRR outperforms even the fastest solvers built from ScaLAPACK's components.

### Peer Reviewed Conference Publications

**On the Performance Prediction of BLAS-based Tensor Contractions**High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, April 2015.@inproceedings{Peise2015:380, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "On the Performance Prediction of BLAS-based Tensor Contractions", booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation", year = 2015, editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.", volume = 8966, series = "Lecture Notes in Computer Science", pages = "193-212", month = apr, publisher = "Springer International Publishing", doi = "10.1007/978-3-319-17248-4_10", url = "http://arxiv.org/pdf/1409.8608v1" }

abstractwebPDFbibtexTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.**A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels**High Performance Computing for Computational Science -- VECPAR 2014, Lecture Notes in Computer Science, Volume 8969, pp. 245-258, Springer International Publishing, April 2015.@inproceedings{Peise2015:250, author = "Elmar Peise and Paolo Bientinesi", title = "A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels", booktitle = "High Performance Computing for Computational Science -- VECPAR 2014", year = 2015, editor = "Daydé, Michel and Marques, Osni and Nakajima, Kengo", volume = 8969, series = "Lecture Notes in Computer Science", pages = "245-258", month = apr, publisher = "Springer International Publishing", doi = "10.1007/978-3-319-17353-5_21", url = "https://arxiv.org/pdf/1402.5897v1" }

abstractwebPDFbibtexIt is universally known that caching is critical to attain high- performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.**Algorithms for Large-scale Whole Genome Association Analysis**Proceedings of the 20th European MPI Users' Group Meeting, EuroMPI '13, pp. 229-234, ACM, 2013.@inproceedings{Peise2013:504, author = "Elmar Peise and Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Algorithms for Large-scale Whole Genome Association Analysis ", booktitle = "Proceedings of the 20th European MPI Users' Group Meeting", year = 2013, series = "EuroMPI '13", pages = "229--234 ", address = "New York, NY, USA", publisher = "ACM", doi = "10.1145/2488551.2488577", url = "http://arxiv.org/pdf/1304.2272v1" }

abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datasets.**Performance Modeling for Dense Linear Algebra**Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (PMBS12), SCC '12, pp. 406-416, IEEE Computer Society, November 2012.@inproceedings{Peise2012:50, author = "Elmar Peise and Paolo Bientinesi", title = "Performance Modeling for Dense Linear Algebra", booktitle = "Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (PMBS12)", year = 2012, series = "SCC '12", pages = "406--416", address = "Washington, DC, USA", month = nov, publisher = "IEEE Computer Society", doi = "10.1109/SC.Companion.2012.60", url = "http://arxiv.org/pdf/1209.2364" }

abstractwebPDFbibtexIt is well known that the behavior of dense linear algebra algorithms is greatly influenced by factors like target architecture, underlying libraries and even problem size; because of this, the accurate prediction of their performance is a real challenge. In this article, we are not interested in creating accurate models for a given algorithm, but in correctly ranking a set of equivalent algorithms according to their performance. Aware of the hierarchical structure of dense linear algebra routines, we approach the problem by developing a framework for the automatic generation of statistical performance models for BLAS and LAPACK libraries. This allows us to obtain predictions through evaluating and combining such models. We demonstrate that our approach is successful in both single- and multi-core environments, not only in the ranking of algorithms but also in tuning their parameters.**Branch and Bound for Optimal Jacobian Accumulation**Proceedings of the Fifth SIAM Workshop on Combinatorial Scientific Computing (CSC11), pp. 73-76, 2011.bibtex@inproceedings{Mosenkis2011:868, author = "Viktor Mosenkis and Elmar Peise and Uwe Naumann", title = "Branch and Bound for Optimal Jacobian Accumulation", booktitle = "Proceedings of the Fifth SIAM Workshop on Combinatorial Scientific Computing (CSC11)", year = 2011, pages = "73-76" }

**Low-Memory Tour Reversal in Directed Graphs**Combinatorial Scientific Computing, Dagstuhl Seminar Proceedings, Number 9061, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2009.@inproceedings{Mosenkis2009:928, author = "Viktor Mosenkis and Elmar Peise and Uwe Naumann", title = "Low-Memory Tour Reversal in Directed Graphs", booktitle = "Combinatorial Scientific Computing", year = 2009, number = 9061, series = "Dagstuhl Seminar Proceedings", address = "Dagstuhl, Germany", publisher = "Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany" }

abstractwebbibtexWe consider the problem of reversing a*tour*$(i_1, i_2, \ldots, i_l)$ in a directed graph $G = (V, E)$ with positive integer vertices $V$ and edges $E \subseteq V \times V$, where $i_j \in V$ and $(i_j, i_{j + 1}) \in E$ for all $j = 1, \ldots, l - 1$. The tour can be processed in last-in-first-out order as long as the size of the corresponding stack does not exceed the available memory. This constraint is violated in most cases when considering control-flow graphs of large-scale numerical simulation programs. The tour reversal problem also arises in adjoint programs used, for example, in the context of derivative-based nonlinear optimization, sensitivity analysis, or other, often inverse, problems. The intention is to compress the tour in order not to run out of memory. As the general optimal compression problem was proven to be NP-hard and big control-flow graphs results from loops in programs we do not want to use general purpose algorithms to compress the tour. We want rather to compress the tour by finding loops and replace the redundant information by proper representation of the loops.

### Theses

**Performance Modeling and Prediction for Dense Linear Algebra**Dissertation, RWTH Aachen University, 2017.@phdthesis{Peise2017:620, author = "Elmar Peise", title = "Performance Modeling and Prediction for Dense Linear Algebra", school = "RWTH Aachen University", year = 2017, type = "Dissertation", address = "Aachen", doi = "10.18154/RWTH-2018-223017", url = "https://publications.rwth-aachen.de/record/721186/files/721186.pdf" }

abstractwebPDFbibtexThis dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.**Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms**AICES, RWTH Aachen, May 2012.@mastersthesis{Peise2012:168, author = "Elmar Peise", title = "Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms", school = "AICES, RWTH Aachen", year = 2012, month = may, url = "http://arxiv.org/pdf/1207.5217v3" }

abstractwebPDFbibtexA large class of dense linear algebra operations, such as LU decomposition or inversion of a triangular matrix, are usually performed by blocked algorithms. For one such operation, typically, not only one but many algorithmic variants exist; depending on computing architecture, libraries and problem size, each variant attains a different performances. We propose methods and tools to rank the algorithmic variants according to their performance for a given scenario without executing them. For this purpose, we identify the routines upon which the algorithms are built. A first tool - the Sampler - measures the performance of these routines. Using the Sampler, a second tool models their performance. The generated models are then used to predict the performance of the considered algorithms. For a given scenario, these predictions allow us to correctly rank the algorithms according to their performance without executing them. With the help of the same tools, algorithmic parameters such as block-size can be optimally tuned.

### Technical Reports

**Large Scale Parallel Computations in R through Elemental**October 2016.@techreport{Canales2016:718, author = "Rodrigo Canales and Elmar Peise and Paolo Bientinesi", title = "Large Scale Parallel Computations in R through Elemental", year = 2016, month = oct, url = "https://arxiv.org/pdf/1610.07310v1" }

abstractwebPDFbibtexEven though in recent years the scale of statistical analysis problems has increased tremendously, many statistical software tools are still limited to single-node computations. However, statistical analyses are largely based on dense linear algebra operations, which have been deeply studied, optimized and parallelized in the high-performance-computing community. To make high-performance distributed computations available for statistical analysis, and thus enable large scale statistical computations, we introduce RElem, an open source package that integrates the distributed dense linear algebra library Elemental into R. While on the one hand, RElem provides direct wrappers of Elemental's routines, on the other hand, it overloads various operators and functions to provide an entirely native R experience for distributed computations. We showcase how simple it is to port existing R programs to Relem and demonstrate that Relem indeed allows to scale beyond the single-node limitation of R with the full performance of Elemental without any overhead.**Cache-aware Performance Modeling and Prediction for Dense Linear Algebra**November 2014.@techreport{Peise2014:904, author = "Elmar Peise and Paolo Bientinesi", title = "Cache-aware Performance Modeling and Prediction for Dense Linear Algebra", year = 2014, month = nov, url = "http://arxiv.org/pdf/1409.8602v1" }

abstractPDFbibtexCountless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK, and typically each operation can be obtained in many alternative ways. Interestingly, identifying the fastest implementation---without executing it---is a challenging task even for experts. An equally challenging task is that of tuning each routine to performance-optimal configurations. Indeed, the problem is so difficult that even the default values provided by the libraries are often considerably suboptimal; as a solution, normally one has to resort to executing and timing the routines, driven by some form of parameter search. In this paper, we discuss a methodology to solve both problems: identifying the best performing algorithm within a family of alternatives, and tuning algorithmic parameters for maximum performance; in both cases, we do not execute the algorithms themselves. Instead, our methodology relies on timing and modeling the computational kernels underlying the algorithms, and on a technique for tracking the contents of the CPU cache. In general, our performance predictions allow us to tune dense linear algebra algorithms within few percents from the best attainable results, thus allowing computational scientists and code developers alike to efficiently optimize their linear algebra routines and codes.