# Talks - Elmar Peise

**Performance Modeling and Prediction for Dense Linear Algebra**RWTH Aachen, November 2017.

PhD Defense.abstractPDFThis dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.**The ELAPS Framework: Experimental Linear Algebra Performance Studies**SIAM Conference on Parallel Processing for Scientific Computing.

Université Pierre et Marie Curie, Paris, April 2016.abstractwebPDF**On the Performance Prediction of BLAS-based Tensor Contractions**5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS14).

SC14, New Orleans, LA, USA, 16 November 2014.abstractwebPDFTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.**Estimating the Efficiency of BLAS-based Tensor Contractions**Annual Report 2.

AICES, RWTH Aachen, 6 November 2014.**A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels**The Ninth International Workshop on Automatic Performance Tuning (iWAPT2014), VECPAR 2014.

University of Oregon and Hilton Conference Center, Eugene, Oregon, USA, July 2014.abstractwebPDFIt is universally known that caching is critical to attain high-performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.**Algorithms for Large-scale Whole Genome Association Analysis**PBio 2013: International Workshop on Parallelism in Bioinformatics.

EuroMPI 2013, Madrid, September 2013.abstractPDFIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datasets.**Performance Modeling for DLA Kernels**BLIS Retreat.

University of Texas at Austin, September 2013.**High Performance Computational Biology: Asynchronous IO and Elemental for GWAS**Annual Report 1.

AICES, RWTH Aachen, August 2013.**Performance Modeling for Dense Linear Algebra**3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS12).

SC12, Salt Lake City, Utah, USA, November 2012.abstractPDFIt is well known that the behavior of dense linear algebra algorithms is greatly influenced by factors like target architecture, underlying libraries and even problem size; because of this, the accurate prediction of their performance is a real challenge. In this article, we are not interested in creating accurate models for a given algorithm, but in correctly ranking a set of equivalent algorithms according to their performance. Aware of the hierarchical structure of dense linear algebra routines, we approach the problem by developing a framework for the automatic generation of statistical performance models for BLAS and LAPACK libraries. This allows us to obtain predictions through evaluating and combining such models. We demonstrate that our approach is successful in both single- and multi-core environments, not only in the ranking of algorithms but also in tuning their parameters.**Performance Modeling for Ranking Blocked Algorithms**Parallel Matrix Algorithms and Applications (PMAA) 2012.

Birkbeck University of London, June 2012.abstractPDFA large class of dense linear algebra operations, such as LU decomposition or inversion of a triangular matrix, are usually performed by blocked algorithms. For one such operation, typically, not only one but many algorithmic variants exist; depending on computing architecture, libraries and problem size, each variant attains a different performances. We propose methods and tools to rank the algorithmic variants according to their performance for a given scenario without executing them. For this purpose, we identify the routines upon which the algorithms are built. A first tool ,the Sampler, measures the performance of these routines. Using the Sampler, a second tool models their performance. The generated models are then used to predict the performance of the considered algorithms. For a given scenario, these predictions allow us to correctly rank the algorithms according to their performance without executing them. With the help of the same tools, algorithmic parameters such as block-size can be optimally tuned.**Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms**Master's Thesis colloquium.

AICES, RWTH Aachen, April 2012.abstractPDFA large class of dense linear algebra operations, such as LU decomposition or inversion of a triangular matrix, are usually performed by blocked algorithms. For one such operation, typically, not only one but many algorithmic variants exist; depending on architecture, libraries and problem size, each variant attains a different performances. In my project, I developed methods and tools to rank the algorithmic variants according to their performance for a given scenario without executing them. For this purpose, I analyze the routines upon which the algorithms are built and introduce a tool that, based on measurements, models their performance. The generated performance models are then used to predict the performance of the considered algorithms. For a given scenario, these predictions allow me not only to rank the variants but to determine the optimal algorithm configuration.