# Publications - Paolo Bientinesi

### Submitted Papers

**Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions**Submitted to ACM TOMS, October 2020.abstractwebPDFTensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.**The Linear Algebra Mapping Problem. Current state of linear algebra languages and libraries.**November 2019.abstractwebPDFWe observe a disconnect between the developers and the end users of linear algebra libraries. On the one hand, the numerical linear algebra and the high-performance communities invest significant effort in the development and optimization of highly sophisticated numerical kernels and libraries, aiming at the maximum exploitation of both the properties of the input matrices, and the architectural features of the target computing platform. On the other hand, end users are progressively less likely to go through the error-prone and time consuming process of directly using said libraries by writing their code in C or Fortran; instead, languages and libraries such as Matlab, Julia, Eigen and Armadillo, which offer a higher level of abstraction, are becoming more and more popular. Users are given the opportunity to code matrix computations with a syntax that closely resembles the mathematical description; it is then a compiler or an interpreter that internally maps the input program to lower level kernels, as provided by libraries such as BLAS and LAPACK. Unfortunately, our experience suggests that in terms of performance, this translation is typically vastly suboptimal. In this paper, we first introduce the Linear Algebra Mapping Problem, and then investigate how effectively a benchmark of test problems is solved by popular high-level programming languages. Specifically, we consider Matlab, Octave, Julia, R, Armadillo (C++), Eigen (C++), and NumPy (Python); the benchmark is meant to test both standard compiler optimizations such as common subexpression elimination and loop-invariant code motion, as well as linear algebra specific optimizations such as optimal parenthesization of a matrix product and kernel selection for matrices with properties. The aim of this study is to give concrete guidelines for the development of languages and libraries that support linear algebra computations.

### Journal Articles

**Work-stealing prefix scan: Addressing load imbalance in large-scale image registration**IEEE Transactions on Parallel and Distributed Systems, Volume 33(3), pp. 523-535, 2022.@article{Copik2022:900, author = "Marcin Copik and {Tobias } Grosser and {Torsten } Hoefler and Paolo Bientinesi and Benjamin Berkels", title = "Work-stealing prefix scan: Addressing load imbalance in large-scale image registration", journal = "IEEE Transactions on Parallel and Distributed Systems", year = 2022, volume = 33, number = 3, pages = "523-535", url = "https://arxiv.org/pdf/2010.12478.pdf" }

abstractwebPDFbibtexParallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this paper, we study the recursive registration of a series of electron microscopy images – a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.**Linnea: Automatic Generation of Efficient Linear Algebra Programs**ACM Transactions on Mathematical Software (TOMS), Volume 47(3), pp. 1-26, June 2021.@article{Barthels2021:688, author = "Henrik Barthels and Christos Psarras and Paolo Bientinesi", title = "Linnea: Automatic Generation of Efficient Linear Algebra Programs", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2021, volume = 47, number = 3, pages = "1--26", month = jun, url = "https://arxiv.org/pdf/1912.12924.pdf" }

abstractwebPDFbibtexThe translation of linear algebra computations into efficient sequences of library calls is a non-trivial task that requires expertise in both linear algebra and high-performance computing. Almost all high-level languages and libraries for matrix computations (e.g., Matlab, Eigen) internally use optimized kernels such as those provided by BLAS and LAPACK; however, their translation algorithms are often too simplistic and thus lead to a suboptimal use of said kernels, resulting in significant performance losses. In order to combine the productivity offered by high-level languages, and the performance of low-level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem; as output, it returns an efficient sequence of calls to high-performance kernels. Linnea uses a custom best-first search algorithm to find a first solution in less than a second, and increasingly better solutions when given more time. In 125 test problems, the code generated by Linnea almost always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.**Rational spectral filters with optimal convergence rate**SIAM Journal of Scientific Computing, Volume 43(4), pp. A2660-A2684, 2021.@article{Kollnig2021:978, author = "Konrad Kollnig and Paolo Bientinesi and Edoardo {Di Napoli}", title = "Rational spectral filters with optimal convergence rate", journal = "SIAM Journal of Scientific Computing", year = 2021, volume = 43, number = 4, pages = "A2660-A2684", url = "https://arxiv.org/pdf/2001.04184.pdf" }

abstractwebPDFbibtexIn recent years, contour-based eigensolvers have emerged as a standard approach for the solution of large and sparse eigenvalue problems. Building upon recent performance improvements through non-linear least square optimization of so-called rational filters, we introduce a systematic method to design these filters by minimizing the worst-case convergence ratio and eliminate the parametric dependence on weight functions. Further, we provide an efficient way to deal with the box-constraints which play a central role for the use of iterative linear solvers in contour-based eigensolvers. Indeed, these parameter-free filters consistently minimize the number of iterations and the number of FLOPs to reach convergence in the eigensolver. As a byproduct, our rational filters allow for a simple solution to load balancing when the solution of an interior eigenproblem is approached by the slicing of the sought after spectral interval.**Accelerating AIREBO: Navigating the Journey from Complex Legacy Code to High Performance**Journal of Computational Chemistry, Volume 40(14), May 2019.@article{Höhnerbach2019:708, author = "Markus Höhnerbach and Paolo Bientinesi", title = "Accelerating AIREBO: Navigating the Journey from Complex Legacy Code to High Performance", journal = "Journal of Computational Chemistry", year = 2019, volume = 40, number = 14, month = may, url = "https://arxiv.org/pdf/1810.07026.pdf" }

abstractwebPDFbibtexDespite initiatives to improve the quality of scientific codes, there still is a large presence of legacy code. Such code often needs to implement a lot of functionality under time constrains, sacrificing quality. Additionally, quality is rarely improved by optimizations for new architectures. This development model leads to code that is increasingly difficult to work with. Our suggested solution includes complexity-reducing refactoring and hardware abstraction. We focus on the AIREBO potential from LAMMPS, where the challenge is that any potential kernel is rather large and complex, hindering systematic optimization. This issue is common to codes that model multiple physical phenomena. We present our journey from the C++ port of a previous Fortran code to performance-portable, KNC-hybrid, vectorized, scalable, optimized code supporting full and reduced precision. The journey includes extensive testing that fixed bugs in the original code. Large-scale, full-precision runs sustain speedups of more than 4x (KNL) and 3x (Skylake).**Assessment of sound spatialisation algorithms for sonic rendering with headsets**Journal of New Music Research, Volume 48(2), pp. 107-124, March 2019.@article{Tarzan2019:528, author = "Ali Tarzan and Paolo Bientinesi and Marco Alunno", title = "Assessment of sound spatialisation algorithms for sonic rendering with headsets", journal = "Journal of New Music Research ", year = 2019, volume = 48, number = 2, pages = "107-124", month = mar, url = "https://arxiv.org/abs/1711.09234" }

abstractPDFbibtexGiven an input sound signal and a target virtual sound source, sound spatialisation algorithms manipulate the signal so that a listener perceives it as though it were emitted from the target source. There exist several established spatialisation approaches that deliver satisfactory results when loudspeakers are used to playback the manipulated signal. As headphones have a number of desirable characteristics over loudspeakers, such as portability, isolation from the surrounding environment, cost and ease of use, it is interesting to explore how a sense of acoustic space can be conveyed through them. This article first surveys traditional spatialisation approaches intended for loudspeakers, and then reviews them with regard to their adaptability to headphones.**Spin Summations: A High-Performance Perspective**ACM Transactions on Mathematical Software (TOMS), Volume 45(1), March 2019.@article{Springer2019:60, author = "Paul Springer and Devin Matthews and Paolo Bientinesi", title = "Spin Summations: A High-Performance Perspective", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2019, volume = 45, number = 1, month = mar, url = "https://arxiv.org/pdf/1705.06661.pdf" }

abstractwebPDFbibtexBesides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms---including CCSD(T) and CCSDT(Q)---are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to high-performance calculations, including temporal locality as well as scattered memory accesses. This publication explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4x and 5.5x over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2x over an out-of-place variant.**Accelerating scientific codes by performance and accuracy modeling**Journal of Computational Science, April 2018.

Accepted.@article{Fabregat-Traver2018:408, author = "Diego Fabregat-Traver and {Ahmed E.} Ismail and Paolo Bientinesi", title = "Accelerating scientific codes by performance and accuracy modeling ", journal = "Journal of Computational Science", year = 2018, month = apr, note = "Accepted", url = "http://arxiv.org/abs/1608.04694" }

abstractPDFbibtexScientific software is often driven by multiple parameters that affect both accuracy and performance. Since finding the optimal configuration of these parameters is a highly complex task, it extremely common that the software is used suboptimally. In a typical scenario, accuracy requirements are imposed, and attained through suboptimal performance. In this paper, we present a methodology for the automatic selection of parameters for simulation codes, and a corresponding prototype tool. To be amenable to our methodology, the target code must expose the parameters affecting accuracy and performance, and there must be formulas available for error bounds and computational complexity of the underlying methods. As a case study, we consider the particle-particle particle-mesh method (PPPM) from the LAMMPS suite for molecular dynamics, and use our tool to identify configurations of the input parameters that achieve a given accuracy in the shortest execution time. When compared with the configurations suggested by expert users, the parameters selected by our tool yield reductions in the time-to-solution ranging between 10% and 60%. In other words, for the typical scenario where a fixed number of core-hours are granted and simulations of a fixed number of timesteps are to be run, usage of our tool may allow up to twice as many simulations. While we develop our ideas using LAMMPS as computational framework and use the PPPM method for dispersion as case study, the methodology is general and valid for a range of software tools and methods.**The ELAPS Framework: Experimental Linear Algebra Performance Studies**International Journal of High Performance Computing, pp. 1-13, March 2018.@article{Peise2018:560, author = "Elmar Peise and Paolo Bientinesi", title = "The ELAPS Framework: Experimental Linear Algebra Performance Studies", journal = "International Journal of High Performance Computing", year = 2018, pages = "1-13", month = mar, publisher = "Sage", url = "http://arxiv.org/pdf/1504.08035v1" }

abstractwebPDFbibtexIn scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies framework (ELAPS), a multi-platform open-source environment for easy and fast, yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized GUI, and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this paper, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.**Design of a high-performance GEMM-like Tensor-Tensor Multiplication**ACM Transactions on Mathematical Software (TOMS), Volume 44(3), pp. 28:1-28:29, January 2018.@article{Springer2018:554, author = "Paul Springer and Paolo Bientinesi", title = "Design of a high-performance GEMM-like Tensor-Tensor Multiplication", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2018, volume = 44, number = 3, pages = "28:1--28:29", month = jan, url = "https://arxiv.org/pdf/1607.00145.pdf" }

abstractwebPDFbibtexWe present ''GEMM-like Tensor-Tensor multiplication'' (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned ''macro-kernel''. This macro-kernel operates on suitably prepared (''packed'') sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source ''Tensor Contraction Code Generator'' (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3x. More precisely, GETT achieves speedups of up to 1.42x over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.**Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection**ACM Transactions on Mathematical Software (TOMS), Volume 44(2), pp. 16:1-16:19, September 2017.@article{Peise2017:728, author = "Elmar Peise and Paolo Bientinesi", title = "Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, volume = 44, number = 2, pages = "16:1--16:19", month = sep, publisher = "ACM", address = "New York, NY, USA", url = "http://arxiv.org/pdf/1602.06763v1" }

abstractwebPDFbibtexTo exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACK's blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.**TTC: A high-performance Compiler for Tensor Transpositions**ACM Transactions on Mathematical Software (TOMS), Volume 44(2), pp. 15:1-15:21, August 2017.@article{Springer2017:910, author = "Paul Springer and {Jeff R.} Hammond and Paolo Bientinesi", title = "TTC: A high-performance Compiler for Tensor Transpositions", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2017, volume = 44, number = 2, pages = "15:1--15:21", month = aug, publisher = "ACM", url = "http://arxiv.org/pdf/1603.02297v1" }

abstractwebPDFbibtexWe present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search.**High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods**Computer Physics Communications, Volume 211, pp. 61 - 72, February 2017.

High Performance Computing for Advanced Modeling and Simulation of Materials.@article{Di_Napoli2017:318, author = "Edoardo {Di Napoli} and Elmar Peise and Markus Hrywniak and Paolo Bientinesi", title = "High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods", journal = "Computer Physics Communications", year = 2017, volume = 211, pages = "61 - 72", month = feb, note = "High Performance Computing for Advanced Modeling and Simulation of Materials", url = "http://arxiv.org/pdf/1602.06589v2" }

abstractwebPDFbibtexOne of the greatest effort of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical operations in the model in terms of low level optimized kernels offering both performance and portability. Legacy codes suffers from the additional curse of rigid design choices based on outdated performance metrics (e.g. minimization of memory footprint). Using a representative code from the Materials Science community, we propose a methodology to restructure the most expensive operations in terms of an optimized combination of dense linear algebra kernels. The resulting algorithm guarantees an increased performance and an extended life span of this code enabling larger scale simulations.**Large-Scale Linear Regression: Development of High-Performance Routines**Applied Mathematics and Computation, Volume 275, pp. 411-421, 15 February 2016.@article{Frank2016:90, author = "Alvaro Frank and Diego Fabregat-Traver and Paolo Bientinesi", title = "Large-Scale Linear Regression: Development of High-Performance Routines", journal = "Applied Mathematics and Computation", year = 2016, volume = 275, pages = "411-421", month = feb, url = "http://arxiv.org/abs/1504.07890" }

abstractwebPDFbibtexIn statistics, series of ordinary least squares problems (OLS) are used to study the linear correlation among sets of variables of interest; in many studies, the number of such variables is at least in the millions, and the corresponding datasets occupy terabytes of disk space. As the availability of large-scale datasets increases regularly, so does the challenge in dealing with them. Indeed, traditional solvers---which rely on the use of ``black-box'' routines optimized for one single OLS---are highly inefficient and fail to provide a viable solution for big-data analyses. As a case study, in this paper we consider a linear regression consisting of two-dimensional grids of related OLS problems that arise in the context of genome-wide association analyses, and give a careful walkthrough for the development of {\sc ols-grid}, a high-performance routine for shared-memory architectures; analogous steps are relevant for tailoring OLS solvers to other applications. In particular, we first illustrate the design of efficient algorithms that exploit the structure of the OLS problems and eliminate redundant computations; then, we show how to effectively deal with datasets that do not fit in main memory; finally, we discuss how to cast the computation in terms of efficient kernels and how to achieve scalability. Importantly, each design decision along the way is justified by simple performance models. {\sc ols-grid} enables the solution of $10^{11}$ correlated OLS problems operating on terabytes of data in a matter of hours.**High Performance Solutions for Big-data GWAS**Parallel Computing, Volume 42, pp. 75 - 87, February 2015.

Special issue on Parallelism in Bioinformatics.@article{Peise2015:754, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "High Performance Solutions for Big-data GWAS", journal = "Parallel Computing", year = 2015, volume = 42, pages = "75 - 87", month = feb, note = "Special issue on Parallelism in Bioinformatics", url = "http://arxiv.org/pdf/1403.6426v1" }

abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.**Big-Data, High-Performance, Mixed Models Based Genome-Wide Association Analysis**F1000Research, Volume 3(200), August 2014.

Open peer reviews.@article{Fabregat-Traver2014:430, author = "Diego Fabregat-Traver and Sodbo Sharapov and {Caroline } Hayward and {Igor } Rudan and {Harry } Campbell and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Big-Data, High-Performance, Mixed Models Based Genome-Wide Association Analysis", journal = "F1000Research", year = 2014, volume = 3, number = 200, month = aug, note = "Open peer reviews", howpublished = "F1000Research" }

abstractwebbibtexTo raise the power of genome-wide association studies (GWAS) and avoid false-positive results, one can rely on mixed model based tests. When large samples are used, and especially when multiple traits are to be studied in the omics context, this approach becomes computationally unmanageable. Here, we develop new methods for analysis of arbitrary number of traits, and demonstrate that for the analysis of single-trait and multiple-trait GWAS, different methods are optimal. We implement these methods in a high-performance computing framework that uses state-of-the-art linear algebra kernels, incorporates optimizations, and avoids redundant computations, thus increasing throughput while reducing memory usage and energy consumption. We show that compared to existing libraries, the OmicABEL software---which implements these methods---achieves speed-ups of up to three orders of magnitude. As a consequence, samples of tens of thousands of individuals as well as samples phenotyped for many thousands of ''omics'' measurements can be analyzed for association with millions of omics features without the need for super-computers.**Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies**ACM Transactions on Mathematical Software (TOMS), Volume 40(4), pp. 27:1-27:22, June 2014.@article{Fabregat-Traver2014:798, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2014, volume = 40, number = 4, pages = "27:1--27:22", month = jun, publisher = "ACM", url = "http://arxiv.org/pdf/1210.7683v1" }

abstractwebPDFbibtexIn many scientific and engineering applications one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by the Genome-Wide Association Studies (GWAS), a widely spread tool in computational biology. GWAS entails the solution of up to trillions ($10^{12}$) of correlated generalized least-squares problems, posing a daunting challenge: the performance of petaflops ($10^{15}$ floating-point operations) over terabytes of data residing on disk. In this paper, we design an efficient algorithm for performing GWAS on multi-core architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems, thus reducing the overall computational complexity. Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be distributed among the available cores, to attain high performance and scalability. We believe the paper contributes valuable guidelines of general applicability for computational scientists on how to develop and optimize numerical algorithms.**Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions**Applied Mathematics and Computation, Volume 235, pp. 454-468, May 2014.@article{Di_Napoli2014:210, author = "Edoardo {Di Napoli} and Diego Fabregat-Traver and Gregorio Quintana-Orti and Paolo Bientinesi", title = "Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions", journal = "Applied Mathematics and Computation", year = 2014, volume = 235, pages = "454--468", month = may, publisher = "Elsevier", url = "http://arxiv.org/pdf/1307.2100" }

abstractwebPDFbibtexMathematical operators whose transformation rules constitute the building blocks of a multi-linear algebra are widely used in physics and engineering applications where they are very often represented as tensors. In the last century, thanks to the advances in tensor calculus, it was possible to uncover new research fields and make remarkable progress in the existing ones, from electromagnetism to the dynamics of fluids and from the mechanics of rigid bodies to quantum mechanics of many atoms. By now, the formal mathematical and geometrical properties of tensors are well defined and understood; conversely, in the context of scientific and high-performance computing, many tensor-related problems are still open. In this paper, we address the problem of efficiently computing contractions among two tensors of arbitrary dimension by using kernels from the highly optimized BLAS library. In particular, we establish precise conditions to determine if and when GEMM, the kernel for matrix products, can be used. Such conditions take into consideration both the nature of the operation and the storage scheme of the tensors, and induce a classification of the contractions into three groups. For each group, we provide a recipe to guide the users towards the most effective use of BLAS.**Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures**Applied Mathematics and Computation, Volume 234, pp. 606-617, May 2014.@article{Fabregat-Traver2014:430, author = "Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures", journal = "Applied Mathematics and Computation", year = 2014, volume = 234, pages = "606--617", month = may, url = "http://arxiv.org/pdf/1210.7325v1" }

abstractwebPDFbibtexGeneralized linear mixed-effects models in the context of genome-wide association studies (GWAS) represent a formidable computational challenge: the solution of millions of correlated generalized least-squares problems, and the processing of terabytes of data. We present high performance in-core and out-of-core shared-memory algorithms for GWAS: By taking advantage of domain-specific knowledge, exploiting multi-core parallelism, and handling data efficiently, our algorithms attain unequalled performance. When compared to GenABEL, one of the most widely used libraries for GWAS, on a 12-core processor we obtain 50-fold speedups. As a consequence, our routines enable genome studies of unprecedented size.**Improved Accuracy and Parallelism for MRRR-based Eigensolvers**SIAM Journal of Scientific Computing, Volume 36(2), pp. 240-263, April 2014.@article{Petschow2014:60, author = "Matthias Petschow and {Enrique S.} Quintana-Orti and Paolo Bientinesi", title = "Improved Accuracy and Parallelism for MRRR-based Eigensolvers", journal = "SIAM Journal of Scientific Computing", year = 2014, volume = 36, number = 2, pages = "240--263", month = apr, url = "http://arxiv.org/pdf/1304.1864v3" }

abstractwebPDFbibtexThe real symmetric tridiagonal eigenproblem is of outstanding importance in numerical computations; it arises frequently as part of eigensolvers for standard and generalized dense Hermitian eigenproblems that are based on a reduction to real tridiagonal form. For its solution, the algorithm of Multiple Relatively Robust Representations (MRRR) is among the fastest methods. Although fast, the solvers based on MRRR do not deliver the same accuracy as competing methods like Divide & Conquer or the QR algorithm. In this paper, we demonstrate that the use of mixed precisions leads to improved accuracy of MRRR-based eigensolvers with limited or no performance penalty. As a result, we obtain eigensolvers that are not only equally or more accurate than the best available methods, but also - under most circumstances - faster and more scalable than the competition.**Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials**Journal of Chemical Physics, Volume 140, pp. 024105, January 2014.@article{Tameling2014:590, author = "Daniel Tameling and Paul Springer and Paolo Bientinesi and {Ahmed E.} Ismail", title = "Multilevel Summation for Dispersion: A Linear-Time Algorithm for 1/r^6 Potentials", journal = "Journal of Chemical Physics", year = 2014, volume = 140, pages = 24105, month = jan, url = "https://arxiv.org/pdf/1308.4005.pdf" }

abstractwebPDFbibtexThe multilevel summation (MLS) method was developed to evaluate long-range interactions in molecular dynamics (MD) simulations. MLS was initially introduced for Coulombic potentials; we have extended this method to dispersion interactions. While formally short-ranged, for an accurate calculation of forces and energies in cases such as in interfacial systems, dispersion potentials require long-range methods. Since long-range solvers tend to dominate the time needed to perform MD calculations, increasing their performance is of vital importance. The MLS method offers some significant advantages when compared to mesh-based Ewald methods like the particle-particle particle-mesh and particle mesh Ewald methods. Unlike mesh-based Ewald methods, MLS does not use fast Fourier transforms and is thus not limited by communication and bandwidth concerns. In addition, it scales linearly in the number of particles, as compared to the O(N log N) complexity of the mesh-based Ewald methods. While the structure of the MLS method is invariant for different potentials, every algorithmic step had to be adapted to accommodate the 1/r^6 form of the dispersion interactions. In addition, we have derived error bounds, similar to those obtained by Hardy for the electrostatic MLS. Using a prototype implementation, we can already demonstrate the linear scaling of the MLS method for dispersion, and present results establishing the accuracy and efficiency of the method.**Application-tailored Linear Algebra Algorithms: A Search-Based Approach**International Journal of High Performance Computing Applications (IJHPCA), Volume 27(4), pp. 425 - 438, November 2013.@article{Fabregat-Traver2013:4, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Application-tailored Linear Algebra Algorithms: A Search-Based Approach", journal = "International Journal of High Performance Computing Applications (IJHPCA)", year = 2013, volume = 27, number = 4, pages = "425 - 438", month = nov, publisher = "Sage Publications, Inc.", url = "https://arxiv.org/pdf/1211.5904.pdf" }

abstractwebPDFbibtexIn this paper, we tackle the problem of automatically generating algorithms for linear algebra operations by taking advantage of problem-specific knowledge. In most situations, users possess much more information about the problem at hand than what current libraries and computing environments accept; evidence shows that if properly exploited, such information leads to uncommon/unexpected speedups. We introduce a knowledge-aware linear algebra compiler that allows users to input matrix equations together with properties about the operands and the problem itself; for instance, they can specify that the equation is part of a sequence, and how successive instances are related to one another. The compiler exploits all this information to guide the generation of algorithms, to limit the size of the search space, and to avoid redundant computations. We applied the compiler to equations arising as part of sensitivity and genome studies; the algorithms produced exhibit, respectively, 100- and 1000-fold speedups.**Dissecting the FEAST Algorithm for Generalized Eigenproblems**Journal of Computational and Applied Mathematics, Volume 244, pp. 1-9, May 2013.@article{Kraemer2013:188, author = "Lukas Kraemer and Edoardo {Di Napoli} and {Martin } Galgon and Bruno Lang and Paolo Bientinesi", title = "Dissecting the FEAST Algorithm for Generalized Eigenproblems", journal = "Journal of Computational and Applied Mathematics", year = 2013, volume = 244, pages = "1--9", month = may, url = "http://arxiv.org/abs/1204.1726" }

abstractwebPDFbibtexWe analyze the FEAST method for computing selected eigenvalues and eigenvectors of large sparse matrix pencils. After establishing the close connection between FEAST and the well-known Rayleigh-Ritz method, we identify several critical issues that influence convergence and accuracy of the solver: the choice of the starting vector space, the stopping criterion, how the inner linear systems impact the quality of the solution, and the use of FEAST for computing eigenpairs from multiple intervals. We complement the study with numerical examples, and hint at possible improvements to overcome the existing problems.**High-Performance Solvers for Dense Hermitian Eigenproblems**SIAM Journal on Scientific Computing, Volume 35(1), pp. C1-C22, January 2013.@article{Petschow2013:208, author = "Matthias Petschow and Elmar Peise and Paolo Bientinesi", title = "High-Performance Solvers for Dense Hermitian Eigenproblems", journal = "SIAM Journal on Scientific Computing", year = 2013, volume = 35, number = 1, pages = "C1-C22", month = jan, publisher = "Society for Industrial and Applied Mathematics", url = "http://arxiv.org/pdf/1205.2107.pdf" }

abstractwebPDFbibtexWe introduce a new collection of solvers - subsequently called EleMRRR - for large-scale dense Hermitian eigenproblems. EleMRRR solves various types of problems: generalized, standard, and tridiagonal eigenproblems. Among these, the last is of particular importance as it is a solver on its own right, as well as the computational kernel for the first two; we present a fast and scalable tridiagonal solver based on the Algorithm of Multiple Relatively Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers, PMRRR is part of the freely available Elemental library, and is designed to fully support both message-passing (MPI) and multithreading parallelism (SMP). As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACK's solvers on two supercomputers. Such a study, performed with up to 8,192 cores, provides precise guidelines to assemble the fastest solver within the ScaLAPACK framework; it also indicates that EleMRRR outperforms even the fastest solvers built from ScaLAPACK's components.**Modeling Performance through Memory-Stalls**ACM SIGMETRICS Performance Evaluation Review, Volume 40(2), pp. 86-91, October 2012.@article{Iakymchuk2012:120, author = "Roman Iakymchuk and Paolo Bientinesi", title = "Modeling Performance through Memory-Stalls", journal = "ACM SIGMETRICS Performance Evaluation Review", year = 2012, volume = 40, number = 2, pages = "86--91", month = oct, publisher = "ACM", address = "New York, NY, USA", url = "http://hpac.cs.umu.se/~pauldj/pubs/pmbs.pdf" }

abstractwebPDFbibtexWe aim at modeling the performance of linear algebra algorithms without executing either the algorithms or any parts of them. The performance of an algorithm can be expressed in terms of the time spent on CPU execution and memory-stalls. The main concern of this paper is to build analytical models to accurately predict memory-stalls. The scenario in which data resides in the L2 cache is considered; with this assumption, only L1 cache misses occur. We construct an analytical formula for modeling the L1 cache misses of fundamental linear algebra operations such as those included in the Basic Linear Algebra Subprograms (BLAS) library. The number of cache misses occurring in higher-level algorithms--like a matrix factorization--is then predicted by combining the models for the appropriate BLAS subroutines. As case studies, we consider the LU factorization and GER--a BLAS operation and a building block for the LU factorization. We validate the models on both Intel and AMD processors, attaining remarkably accurate performance predictions.**Correlations in Sequences of Generalized Eigenproblems Arising in Density Functional Theory**Computer Physics Communications (CPC), Volume 183(8), pp. 1674-1682, August 2012.@article{Di_Napoli2012:160, author = "Edoardo {Di Napoli} and Stefan Bluegel and Paolo Bientinesi", title = "Correlations in Sequences of Generalized Eigenproblems Arising in Density Functional Theory", journal = "Computer Physics Communications (CPC)", year = 2012, volume = 183, number = 8, pages = "1674-1682", month = aug, url = "http://arxiv.org/pdf/1108.2594v1" }

abstractwebPDFbibtexDensity Functional Theory (DFT) is one of the most used {\em ab initio} theoretical frameworks in materials science. It derives the ground state properties of multi-atomic ensembles directly from the computation of their one-particle density \nr . In DFT-based simulations the solution is calculated through a chain of successive self-consistent cycles; in each cycle a series of coupled equations (Kohn-Sham) translates to a large number of generalized eigenvalue problems whose eigenpairs are the principal means for expressing \nr. A simulation ends when \nr\ has converged to the solution within the required numerical accuracy. This usually happens after several cycles, resulting in a process calling for the solution of many sequences of eigenproblems. In this paper, the authors report evidence showing unexpected correlations between adjacent eigenproblems within each sequence and suggest the investigation of an alternative computational approach: information extracted from the simulation at one step of the sequence is used to compute the solution at the next step. The implications are multiple: from increasing the performance of material simulations, to the development of a mixed direct-iterative solver, to modifying the mathematical foundations of the DFT computational paradigm in use, thus opening the way to the investigation of new materials.**Solving Dense Generalized Eigenproblems on Multi-Threaded Architectures**Applied Mathematics and Computation, Volume 218(22), pp. 11279-11289, July 2012.@article{Aliaga2012:420, author = "Jose' Aliaga and Paolo Bientinesi and Davor Davidovic and Edoardo {Di Napoli} and Francisco Igual and {Enrique S.} Quintana-Orti", title = "Solving Dense Generalized Eigenproblems on Multi-Threaded Architectures", journal = "Applied Mathematics and Computation", year = 2012, volume = 218, number = 22, pages = "11279-11289", month = jul, url = "http://arxiv.org/pdf/1111.6374v1" }

abstractwebPDFbibtexWe compare two approaches to compute a portion of the spectrum of dense symmetric definite generalized eigenproblems: one is based on the reduction to tridiagonal form, and the other on the Krylov-subspace iteration. Two large-scale applications, arising in molecular dynamics and materials science, are employed to investigate the contributions of the application, architecture, and parallelism of the method to the performance of the solvers. The experimental results on a state-of-the-art 8-core platform, equipped with a graphics processing unit (GPU), reveal that in real applications, iterative Krylov-subspace methods can be a competitive approach also for the solution of dense problems.**Deriving Dense Linear Algebra Libraries**Formal Aspects of Computing, pp. 1-13, January 2012.@article{Bientinesi2012:978, author = "Paolo Bientinesi and {John A.} Gunnels and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Tyler Rhodes and Robert {van de Geijn} and Field {Van Zee}", title = "Deriving Dense Linear Algebra Libraries", journal = "Formal Aspects of Computing", year = 2012, pages = "1--13", month = jan, url = "http://hpac.cs.umu.se/~pauldj/pubs/FormAspComp.pdf" }

abstractwebPDFbibtexStarting in the late 1960s computer scientists including Dijkstra and Hoare advocated goal-oriented program- ming and the formal derivation of algorithms. The chief impediment to realizing this for loop-based programs was that a priori determination of loop-invariants, a prerequisite for developing loops, was a task too complex for any but the simplest of operations. Around 2000, these techniques were for the first time successfully applied to the domain of high-performance dense linear algebra libraries. This has led to a multitude of papers, mostly published in the ACM Transactions for Mathematical Software, as well as a system for the mechanical derivation of algorithms and a high-performance linear algebra library, libflame, that includes more than a thousand vari- ants of algorithms for more than a hundred linear algebra operations. To our knowledge, this success story has unfolded with limited awareness on the part the formal methods community. This paper reports on ten years of experience and is meant to raise that awareness.**MR3-SMP: A Symmetric Tridiagonal Eigensolver for Multi-Core Architectures**Parallel Computing, Volume 37(12), December 2011.@article{Petschow2011:254, author = "Matthias Petschow and Paolo Bientinesi", title = "MR3-SMP: A Symmetric Tridiagonal Eigensolver for Multi-Core Architectures", journal = "Parallel Computing", year = 2011, volume = 37, number = 12, month = dec, url = "http://hpac.cs.umu.se/~pauldj/pubs/MR3-SMP-TR.pdf" }

abstractwebPDFbibtexThe computation of eigenvalues and eigenvectors of symmetric tridiagonal matrices arises frequently in applications; often as one of the steps in the solution of Hermitian and symmetric eigenproblems. While several accurate and efficient methods for the tridiagonal eigenproblem exist, their corresponding implementations usually target uni-processors or large distributed memory systems. Our new eigensolver MR3-SMP is instead specifically designed for multi-core and many-core general purpose processors, which today have effectively replaced uni-processors. We show that in most cases MR3-SMP is faster and achieves better speedups than state-of-the-art eigensolvers for uni-processors and distributed-memory systems.**Condensed Forms for the Symmetric Eigenvalue Problem on Multi-threaded Architectures**Concurrency and Computation: Practice and Experience, Volume 23(7), May 2011.@article{Bientinesi2011:754, author = "Paolo Bientinesi and Francisco Igual and Daniel Kressner and Matthias Petschow and {Enrique S.} Quintana-Orti", title = "Condensed Forms for the Symmetric Eigenvalue Problem on Multi-threaded Architectures", journal = "Concurrency and Computation: Practice and Experience", year = 2011, volume = 23, number = 7, month = may, url = "http://hpac.cs.umu.se/~pauldj/pubs/red-multi-TR.pdf" }

abstractwebPDFbibtexWe investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem, on general-purpose multi-core processors. In response to the advances of hardware accelerators, we also modify the code in the SBR toolbox to accelerate the computation by off-loading a significant part of the operations to a graphics processor (GPU). The performance results illustrate the parallelism and scalability of these algorithms on current high-performance multi-core and many-core architectures.**Goal-Oriented and Modular Stability Analysis**SIAM Journal on Matrix Analysis and Applications, Volume 32(1), pp. 286 - 308, March 2011.@article{Bientinesi2011:810, author = "Paolo Bientinesi and Robert {van de Geijn}", title = "Goal-Oriented and Modular Stability Analysis", journal = "SIAM Journal on Matrix Analysis and Applications", year = 2011, volume = 32, number = 1, pages = "286 -- 308", month = mar, url = "http://hpac.cs.umu.se/~pauldj/pubs/Stability-74105.pdf" }

abstractwebPDFbibtexWe introduce a methodology for obtaining inventories of error results for families of numerical dense linear algebra algorithms. The approach for deriving the analyses is goal-oriented, systematic, and layered. The presentation places the analysis side-by-side with the algorithm so that it is obvious where roundoff error is introduced. The approach supports the analysis of more complex algorithms, such as the blocked LU factorization. For this operation we derive a tighter error bound than has been previously reported in the literature.**Sparse Direct Factorizations through Unassembled Hyper-Matrices**Computer Methods in Applied Mechanics and Engineering, Volume 199(9), pp. 430-438, January 2010.@article{Bientinesi2010:700, author = "Paolo Bientinesi and Victor Eijkhout and Kyungjoo Kim and Jason Kurtz and Robert {van de Geijn}", title = "Sparse Direct Factorizations through Unassembled Hyper-Matrices", journal = "Computer Methods in Applied Mechanics and Engineering", year = 2010, volume = 199, number = 9, pages = "430--438", month = jan, url = "http://hpac.cs.umu.se/~pauldj/pubs/UHM-TR-09.pdf" }

abstractwebPDFbibtexWe set out to efficiently compute the solution of a sequence of linear systems Aixi = bi, where the matrix Ai is tightly related to matrix Ai-1. In the setting of an hp-adaptive Finite Element Method, the sequence of matrices Ai results from successive local refinements of the problem domain. At any step i > 1, a factorization already exists and it is the updated linear system relative to the refined mesh for which a factorization must be computed in the least amount of time. This observation holds the promise of a tremendous reduction in the cost of an individual refinement step. We argue that traditional matrix storage schemes, whether dense or sparse, are a bottleneck, limiting the potential efficiency of the solvers. We propose a new hierarchical data structure, the Unassembled Hyper-Matrix (UHM), which allows the matrix to be stored as a tree of unassembled element matrices, hierarchically ordered to mirror the refinement history of the physical domain. The factorization of such an UHM proceeds in terms of element matrices, only assembling nodes when they need to be eliminated. Efficiency comes in terms of both performance and space requirements.**Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix**ACM Transactions on Mathematical Software, Volume 35(1), July 2008.@article{Bientinesi2008:648, author = "Paolo Bientinesi and Brian Gunter and Robert {van de Geijn}", title = "Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix", journal = "ACM Transactions on Mathematical Software", year = 2008, volume = 35, number = 1, month = jul, url = "http://hpac.cs.umu.se/~pauldj/pubs/SPDInv.pdf" }

abstractwebPDFbibtexWe study the high-performance implementation of the inversion of a Symmetric Positive Definite (SPD) matrix on architectures ranging from sequential processors to Symmetric MultiProcessors to distributed memory parallel computers. This inversion is traditionally accomplished in three Ã¢??sweepsÃ¢??: a Cholesky factorization of the SPD matrix, the inversion of the resulting triangular matrix, and finally the multiplication of the inverted triangular matrix by its own transpose. We state different algorithms for each of these sweeps as well as algorithms that compute the result in a single sweep. One algorithm outperforms the current ScaLAPACK implementation by 20-30 percent due to improved load-balance on a distributed memory architecture.**Scalable Parallelization of FLAME Code via the Workqueuing Model**ACM Transactions on Mathematical Software, Volume 34(2), March 2008.@article{Van_Zee2008:598, author = "Field {Van Zee} and Paolo Bientinesi and Tze {Meng Low} and Robert {van de Geijn}", title = "Scalable Parallelization of FLAME Code via the Workqueuing Model", journal = "ACM Transactions on Mathematical Software", year = 2008, volume = 34, number = 2, month = mar }

abstractwebbibtexWe discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.**Representing Linear Algebra Algorithms in Code: The FLAME APIs**ACM Transactions on Mathematical Software, Volume 31(1), March 2005.@article{Bientinesi2005:504, author = "Paolo Bientinesi and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "Representing Linear Algebra Algorithms in Code: The FLAME APIs", journal = "ACM Transactions on Mathematical Software", year = 2005, volume = 31, number = 1, month = mar, url = "http://hpac.cs.umu.se/~pauldj/pubs/FLAME-API.pdf" }

abstractwebPDFbibtexIn this article, we present a number of Application Program Interfaces (APIs) for coding linear algebra algorithms. On the surface, these APIs for the MATLAB M-script and C programming languages appear to be simple, almost trivial, extensions of those languages. Yet with them, the task of programming and maintaining families of algorithms for a broad spectrum of linear algebra operations is greatly simplified. In combination with our Formal Linear Algebra Methods Environment (FLAME) approach to deriving such families of algorithms, dozens of algorithms for a single linear algebra operation can be derived, verified to be correct, implemented, and tested, often in a matter of minutes per algorithm. Since the algorithms are expressed in code much like they are explained in a classroom setting, these APIs become not just a tool for implementing libraries, but also a valuable tool for teaching the algorithms that are incorporated in the libraries. In combination with an extension of the Parallel Linear Algebra Package (PLAPACK) API, the approach presents a migratory path from algorithm to MATLAB implementation to high-performance sequential implementation to parallel implementation. Finally, the APIs are being used to create a repository of algorithms and implementations for linear algebra operations, the FLAME Interface REpository (FIRE), which already features hundreds of algorithms for dozens of commonly encountered linear algebra operations.**The Science of Deriving Dense Linear Algebra Algorithms**ACM Transactions on Mathematical Software, Volume 31(1), March 2005.@article{Bientinesi2005:460, author = "Paolo Bientinesi and {John A.} Gunnels and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "The Science of Deriving Dense Linear Algebra Algorithms", journal = "ACM Transactions on Mathematical Software", year = 2005, volume = 31, number = 1, month = mar, url = "http://hpac.cs.umu.se/~pauldj/pubs/Recipe.ps" }

abstractwebPDFbibtexIn this article we present a systematic approach to the derivation of families of high-performance algorithms for a large set of frequently encountered dense linear algebra operations. As part of the derivation a constructive proof of the correctness of the algorithm is generated. The article is structured so that it can be used as a tutorial for novices. However, the method has been shown to yield new high-performance algorithms for well-studied linear algebra operations and should also be of interest to those who wish to produce best-in-class high-performance codes.**A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations**SIAM Journal on Scientific Computing, Volume 27(1), 2005.@article{Bientinesi2005:550, author = "Paolo Bientinesi and Inderjit Dhillon and Robert {van de Geijn}", title = "A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations", journal = "SIAM Journal on Scientific Computing", year = 2005, volume = 27, number = 1, url = "http://hpac.cs.umu.se/~pauldj/pubs/PMR3.pdf" }

abstractwebPDFbibtexWe present a new parallel algorithm for the dense symmetric eigenvalue/eigenvector problem that is based upon the tridiagonal eigensolver, Algorithm MR3, recently developed by Dhillon and Parlett. Algorithm MR3 has a complexity of O(n2) operations for computing all eigenvalues and eigenvectors of a symmetric tridiagonal problem. Moreover the algorithm requires only O(n) extra workspace and can be adapted to compute any subset of k eigenpairs in O(nk) time. In contrast, all earlier stable parallel algorithms for the tridiagonal eigenproblem require O(n3) operations in the worst case, while some implementations, such as divide and conquer, have an extra O(n2) memory requirement. The proposed parallel algorithm balances the workload equally among the processors by traversing a matrix-dependent representation tree which captures the sequence of computations performed by Algorithm MR3. The resulting implementation allows problems of very large size to be solved efficiently---the largest dense eigenproblem solved in-core on a 256 processor machine with 2 GBytes of memory per processor is for a matrix of size 128,000 x 128,000, which required about 8 hours of CPU time. We present comparisons with other eigensolvers and results on matrices that arise in the applications of computational quantum chemistry and finite element modeling of automobile bodies.**On Numerical Approximation of Electrostatic Energy in 3D**Journal of Computational Physics, Volume 146(2), pp. 707-725, November 1998.@article{Finocchiaro1998:418, author = "Daniele Finocchiaro and Marco Pellegrini and Paolo Bientinesi", title = "On Numerical Approximation of Electrostatic Energy in 3D", journal = "Journal of Computational Physics", year = 1998, volume = 146, number = 2, pages = "707-725", month = nov, url = "http://hpac.cs.umu.se/~pauldj/pubs/jcp-final.pdf" }

abstractwebPDFbibtexApproximating the Coulomb self-energy of a charge distribution within a three-dimensional domain and the mutual Coulomb energy of two charge distributions often constitutes a computational bottleneck in the simulation of physical systems. The present article reports on a recently developed computational technique aimed at the numerical evaluation of the six-dimensional integrals arising from Coulomb interactions. Techniques from integral geometry are used to show a reduction of the domain from six-dimensional to two-dimensional. In the process analytic singularities due to Coulomb's law are eliminated. Experimental results on the self-energy of a charged cube show that the proposed method converges rapidly and is competitive with methods proposed in the literature for similar integration problems.

### Book Chapter

**HPC on Competitive Cloud Resources**In Handbook of Cloud Computing, pp. 493-516, Springer, 2010.PDFbibtex@inbook{Bientinesi2010:54, author = "Paolo Bientinesi and Roman Iakymchuk and Jeff Napper", title = "HPC on Competitive Cloud Resources", pages = "493--516", publisher = "Springer", year = 2010, editor = "Borko Furht and Armando Escalante", booktitle = "In Handbook of Cloud Computing", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen University", url = "http://hpac.cs.umu.se/~iakymchuk/pub/AICES-2010-06-01.pdf" }

### Peer Reviewed Conference Publications

**Performance Comparison for Scientific Computations on the Edge via Relative Performance**Proceedings of the 3rd Workshop on Parallel AI and Systems for the Edge (PAISE 2021), March 2021.@inproceedings{_Sankaran2021:48, author = "Aravind Sankaran and Paolo Bientinesi", title = "Performance Comparison for Scientific Computations on the Edge via Relative Performance", year = 2021, month = mar, url = "https://arxiv.org/pdf/2102.12740.pdf" }

abstractwebPDFbibtexIn a typical Internet-of-Things setting that involves scientific applications, a target computation can be evaluated in many different ways depending on the split of computations among various devices. On the one hand, different implementations (or algorithms)---equivalent from a mathematical perspective---might exhibit significant difference in terms of performance. On the other hand, some of the implementations are likely to show similar performance characteristics. In this paper, we focus on analysing the performance of a given set of algorithms by clustering them into performance classes. To this end, we use a measurement-based approach to evaluate and score algorithms based on pair-wise comparisons; we refer to this approach as ``Relative performance analysis". Each comparison yields one of three outcomes: one algorithm can be ``better", ``worse", or ``equivalent" to another; those algorithms evaluating to have ``equivalent'' performance are merged into the same performance class. We show that our clustering methodology facilitates algorithm selection with respect to more than one metric; for instance, from the subset of equivalently fast algorithms, one could then select an algorithm that consumes the least energy on a certain device.**Automatic Generation of Efficient Linear Algebra Programs**Proceedings of the Platform for Advanced Scientific Computing Conference (PASC20), June 2020.@inproceedings{Barthels2020:200, author = "Henrik Barthels and Christos Psarras and Paolo Bientinesi", title = "Automatic Generation of Efficient Linear Algebra Programs", booktitle = "Proceedings of the Platform for Advanced Scientific Computing Conference (PASC20)", year = 2020, month = jun, journal = "CoRR" }

abstractwebbibtexThe level of abstraction at which application experts reason about linear algebra computations and the level of abstraction used by developers of high-performance numerical linear algebra libraries do not match. The former is conveniently captured by high-level languages and libraries such as Matlab and Eigen, while the latter expresses the kernels included in the BLAS and LAPACK libraries. Unfortunately, the translation from a high-level computation to an efficient sequence of kernels is a task, far from trivial, that requires extensive knowledge of both linear algebra and high-performance computing. Internally, almost all high-level languages and libraries use efficient kernels; however, the translation algorithms are too simplistic and thus lead to a suboptimal use of said kernels, with significant performance losses. In order to both achieve the productivity that comes with high-level languages, and make use of the efficiency of low level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem and produces as output an efficient sequence of calls to high-performance kernels. In 25 application problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.**A Timer-Augmented Cost Function for Load Balanced DSMC**13th International Meeting on High Performance Computing for Computational Science (VECPAR 18), Lecture Notes in Computer Science, Volume 11333, September 2019.@inproceedings{McDoniel2019:488, author = "William McDoniel and Paolo Bientinesi", title = "A Timer-Augmented Cost Function for Load Balanced DSMC", booktitle = "13th International Meeting on High Performance Computing for Computational Science (VECPAR 18)", year = 2019, volume = 11333, series = "Lecture Notes in Computer Science", month = sep, url = "https://arxiv.org/pdf/1902.06040.pdf" }

abstractPDFbibtexDue to a hard dependency between time steps, large-scale simulations of gas using the Direct Simulation Monte Carlo (DSMC) method proceed at the pace of the slowest processor. Scalability is therefore achievable only by ensuring that the work done each time step is as evenly apportioned among the processors as possible. Furthermore, as the simulated system evolves, the load shifts, and thus this load-balancing typically needs to be performed multiple times over the course of a simulation. Common methods generally use either crude performance models or processor-level timers. We combine both to create a timer-augmented cost function which both converges quickly and yields well-balanced processor decompositions. When compared to a particle-based performance model alone, our method achieves 2x speedup at steady-state on up to 1024 processors for a test case consisting of a Mach 9 argon jet impacting a solid wall.**Code Generation in Linnea**Proceedings of the 6th International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2019), 22 June 2019.

Extended abstract.@inproceedings{Barthels2019:800, author = "Henrik Barthels and Paolo Bientinesi", title = "Code Generation in Linnea", year = 2019, address = "Phoenix, Arizona", month = jun, note = "Extended abstract", url = "http://hpac.cs.umu.se/~barthels/publications/ARRAY_2019.pdf" }

abstractPDFbibtexLinnea is a code generator for the translation of high-level linear algebra problems to efficient code. Unlike other languages and libraries for linear algebra, Linnea heavily relies on domain-specific knowledge to rewrite expressions and infer matrix properties. Here we focus on two aspects related to code generation and matrix properties: 1) The automatic generation of code consisting of explicit calls to BLAS and LAPACK kernels, and the corresponding challenge with specialized storage formats. 2) A general notion of banded matrices can be used to simplify the inference of many matrix properties. While it is crucial to make use of matrix properties to achieve high performance, inferring those properties is challenging. We show how matrix bandwidth can be used as a unifying language to reason about many common matrix properties.**Extended Pipeline for Content-Based Feature Engineering in Music Genre Recognition**Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, IEEE Press, April 2018.@inproceedings{Raissi2018:320, author = "{Tina } Raissi and Alessandro Tibo and Paolo Bientinesi", title = "Extended Pipeline for Content-Based Feature Engineering in Music Genre Recognition", year = 2018, month = apr, publisher = "IEEE Press", url = "https://arxiv.org/pdf/1805.05324.pdf" }

abstractwebPDFbibtexWe present a feature engineering pipeline for the construction of musical signal characteristics, to be used for the design of a supervised model for musical genre identification. The key idea is to extend the traditional two-step process of extraction and classification with additive stand-alone phases which are no longer organized in a waterfall scheme. The whole system is realized by traversing backtrack arrows and cycles between various stages. In order to give a compact and effective repre- sentation of the features, the standard early temporal integra- tion is combined with other selection and extraction phases: on the one hand, the selection of the most meaningful char- acteristics based on information gain, on the other hand, the inclusion of the nonlinear correlation between this subset of features, determined by an autoencoder. The results of the experiments conducted on GTZAN dataset reveal a noticeable contribution of this methodology towards the model’s perfor- mance in classification task.**The Generalized Matrix Chain Algorithm**Proceedings of 2018 IEEE/ACM International Symposium on Code Generation and Optimization, pp. 11, 24 February 2018.@inproceedings{Barthels2018:130, author = "Henrik Barthels and Marcin Copik and Paolo Bientinesi", title = "The Generalized Matrix Chain Algorithm", booktitle = "Proceedings of 2018 IEEE/ACM International Symposium on Code Generation and Optimization", year = 2018, pages = 11, address = "Vienna, Austria", month = feb, url = "https://arxiv.org/pdf/1804.04021.pdf" }

abstractwebPDFbibtexIn this paper, we present a generalized version of the matrix chain algorithm to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works. The standard matrix chain problem consists in finding the parenthesization of a matrix product $M := A_1 A_2 \cdots A_n$ that minimizes the number of scalar operations. In practical applications, however, one frequently encounters more complicated expressions, involving transposition, inversion, and matrix properties. Indeed, the computation of such expressions relies on a set of computational kernels that offer functionality well beyond the simple matrix product. The challenge then shifts from finding an optimal parenthesization to finding an optimal mapping of the input expression to the available kernels. Furthermore, it is often the case that a solution based on the minimization of scalar operations does not result in the optimal solution in terms of execution time. In our experiments, the generated code outperforms other libraries and languages on average by a factor of about 5. The motivation for this work comes from the fact that---despite great advances in the development of compilers---the task of mapping linear algebra problems to optimized kernels is still to be done manually. In order to relieve the user from this complex task, new techniques for the compilation of linear algebra expressions have to be developed.**Program Generation for Small-Scale Linear Algebra Applications**Proceedings of the International Symposium on Code Generation and Optimization, February 2018.@inproceedings{Spampinato2018:858, author = "{Daniele } Spampinato and Diego Fabregat-Traver and Paolo Bientinesi and Markus Pueschel", title = "Program Generation for Small-Scale Linear Algebra Applications", year = 2018, address = "Vienna, Austria", month = feb, url = "https://arxiv.org/pdf/1805.04775.pdf" }

abstractwebPDFbibtexWe present SLinGen, a program generation system for linear algebra. The input to SLinGen is an application expressed mathematically in a linear-algebra-inspired language (LA) that we define. LA provides basic scalar/vector/matrix additions/multiplications, higher level operations including linear systems solvers, Cholesky and LU factorizations, as well as loops. The output of SLinGen is performance-optimized single-source C code, optionally vectorized with intrinsics. The target of SLinGen are small-scale computations on fixed-size operands, for which a straightforward implementation using optimized libraries (e.g., BLAS or LAPACK) is known to yield suboptimal performance (besides increasing code size and introducing dependencies), but which are crucial in control, signal processing, computer vision, and other domains. Internally, SLinGen uses synthesis and DSL-based techniques to optimize at a high level of abstraction. We benchmark our program generator on three prototypical applications: the Kalman filter, Gaussian process regression, and an L1-analysis convex solver, as well as basic routines including Cholesky factorization and solvers for the continuous-time Lyapunov and Sylvester equations. The results show significant speed-ups compared to straightforward C with icc/clang or a polyhedral optimizer, as well as library-based and template-based implementations.**Efficient Pattern Matching in Python**Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing, November 2017.

In conjunction with SC17: The International Conference for High Performance Computing, Networking, Storage and Analysis.@inproceedings{Krebber2017:404, author = "Manuel Krebber and Henrik Barthels and Paolo Bientinesi", title = "Efficient Pattern Matching in Python", booktitle = "Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing", year = 2017, address = "Denver, Colorado", month = nov, note = "In conjunction with SC17: The International Conference for High Performance Computing, Networking, Storage and Analysis", url = "https://arxiv.org/pdf/1710.00077.pdf" }

abstractwebPDFbibtexPattern matching is a powerful tool for symbolic computations. Applications include term rewriting systems, as well as the manipulation of symbolic expressions, abstract syntax trees, and XML and JSON data. It also allows for an intuitive description of algorithms in the form of rewrite rules. We present the open source Python module MatchPy, which offers functionality and expressiveness similar to the pattern matching in Mathematica. In particular, it includes syntactic pattern matching, as well as matching for commutative and/or associative functions, sequence variables, and matching with constraints. MatchPy uses new and improved algorithms to efficiently find matches for large pattern sets by exploiting similarities between patterns. The performance of MatchPy is investigated on several real-world problems.**Linnea: Compiling Linear Algebra Expressions to High-Performance Code**Proceedings of the 8th International Workshop on Parallel Symbolic Computation, July 2017.@inproceedings{Barthels2017:254, author = "Henrik Barthels and Paolo Bientinesi", title = "Linnea: Compiling Linear Algebra Expressions to High-Performance Code", booktitle = "Proceedings of the 8th International Workshop on Parallel Symbolic Computation", year = 2017, address = "Kaiserslautern, Germany", month = jul }

abstractwebbibtexLinear algebra expressions appear in fields as diverse as computational biology, signal processing, communication technology, finite element methods, and control theory. Libraries such as BLAS and LAPACK provide highly optimized building blocks for just about any linear algebra computation; thus, a linear algebra expression can be evaluated efficiently by breaking it down into those building blocks. However, this is a challenging problem, requiring knowledge in high-performance computing, compilers, and numerical linear algebra. In this paper we give an overview of existing solutions, and introduce Linnea, a compiler that solves this problem. As shown through a set of test cases, Linnea’s results are comparable with those obtained by human experts.**MatchPy: A Pattern Matching Library**Proceedings of the 15th Python in Science Conference, July 2017.@inproceedings{Krebber2017:550, author = "Manuel Krebber and Henrik Barthels and Paolo Bientinesi", title = "MatchPy: A Pattern Matching Library", booktitle = "Proceedings of the 15th Python in Science Conference", year = 2017, address = "Austin, Texas", month = jul, url = "https://arxiv.org/pdf/1710.06915.pdf" }

abstractwebPDFbibtexPattern matching is a powerful tool for symbolic computations, based on the well-defined theory of term rewriting systems. Application domains include algebraic expressions, abstract syntax trees, and XML and JSON data. Unfortunately, no lightweight implementation of pattern matching as general and flexible as Mathematica exists for Python Mathics,MacroPy,patterns,PyPatt. Therefore, we created the open source module MatchPy which offers similar pattern matching functionality in Python using a novel algorithm which finds matches for large pattern sets more efficiently by exploiting similarities between patterns.**HPTT: A High-Performance Tensor Transposition C++ Library**Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY, ACM, June 2017.@inproceedings{Springer2017:558, author = "Paul Springer and Tong Su and Paolo Bientinesi", title = "HPTT: A High-Performance Tensor Transposition C++ Library", booktitle = "Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming", year = 2017, series = "ARRAY", month = jun, publisher = "ACM", url = "https://arxiv.org/pdf/1704.04374.pdf" }

abstractwebPDFbibtexRecently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multi-threading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called micro-kernel. This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4x4 transpose). HPTT also offers an optional autotuning framework—guided by a performance model—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen’s tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1x.**LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi**High Performance Computing. ISC 2017., Volume 10266, pp. 61-78, Springer, June 2017.

Lecture Notes in Computer Science.@inproceedings{McDoniel2017:890, author = "William McDoniel and Markus Höhnerbach and Rodrigo Canales and {Ahmed E.} Ismail and Paolo Bientinesi", title = "LAMMPS' PPPM Long-Range Solver for the Second Generation Xeon Phi", booktitle = "High Performance Computing. ISC 2017.", year = 2017, volume = 10266, pages = "61--78", address = "Cham", month = jun, publisher = "Springer", note = "Lecture Notes in Computer Science", url = "https://arxiv.org/pdf/1702.04250.pdf" }

abstractwebPDFbibtexMolecular Dynamics is an important tool for computational biologists, chemists, and materials scientists, consuming a sizable amount of supercomputing resources. Many of the investigated systems contain charged particles, which can only be simulated accurately using a long-range solver, such as PPPM. We extend the popular LAMMPS molecular dynamics code with an implementation of PPPM particularly suitable for the second generation Intel Xeon Phi. Our main target is the optimization of computational kernels by means of vectorization, and we observe speedups in these kernels of up to 12x. These improvements carry over to LAMMPS users, with overall speedups ranging between 2-3x, without requiring users to retune input parameters. Furthermore, our optimizations make it easier for users to determine optimal input parameters for attaining top performance.**TTC: A Tensor Transposition Compiler for Multiple Architectures**Proceedings of the 3rd International Workshop on Libraries, Languages and Compilers for Programming (ARRAY 2016), June 2016.@inproceedings{Springer2016:940, author = "Paul Springer and Aravind Sankaran and Paolo Bientinesi", title = "TTC: A Tensor Transposition Compiler for Multiple Architectures", year = 2016, month = jun, url = "https://arxiv.org/pdf/1607.01249.pdf" }

abstractPDFbibtexWe consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to 8x and 32x, respectively. We also showcase TTC's support for multiple leading dimensions, making it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.**The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability**Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'16, Number 7, pp. 7:1-7:13, IEEE Press, 2016.

Selected for Reproducibility Initiative at SC17.@inproceedings{Höhnerbach2016:78, author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi", title = "The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability", booktitle = "Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis", year = 2016, number = 7, series = "SC'16", pages = "7:1--7:13", publisher = "IEEE Press", note = "Selected for Reproducibility Initiative at SC17", url = "https://arxiv.org/pdf/1607.02904v1" }

abstractwebPDFbibtexMolecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at achieving performance portability. Compared with well-studied pair potentials, multibody potentials deliver increased simulation accuracy but are too complex for effective compiler optimization. Because of this, achieving cross-platform performance remains an open question. By abstracting from target architecture and computing precision, we develop a vectorization scheme applicable to both CPUs and accelerators. We present results for the Tersoff potential within the molecular dynamics code LAMMPS on several architectures, demonstrating efficiency gains not only for computational kernels, but also for large-scale simulations. On a cluster of Intel Xeon Phi's, our optimized solver is between 3 and 5 times faster than the pure MPI reference.**The Tersoff many-body potential: Sustainable performance through vectorization**Proceedings of the SC15 Workshop: Producing High Performance and Sustainable Software for Molecular Simulation, November 2015.@inproceedings{Höhnerbach2015:718, author = "Markus Höhnerbach and {Ahmed E.} Ismail and Paolo Bientinesi", title = "The Tersoff many-body potential: Sustainable performance through vectorization", year = 2015, month = nov, url = "http://hpac.cs.umu.se/ipcc/sc15_paper.pdf" }

abstractPDFbibtexThis contribution discusses the effectiveness and the sustainability of vectorization in molecular dynamics. As a case study, we present results for multi-body potentials on a range of vector instruction sets, targeting both CPUs and accelerators such as GPUs and the Intel Xeon Phi. The shared-memory and distributed-memory parallelization of MD simulations are problems that have been extensively treated; by contrast, vectorization is a relatively unexplored dimension. However, given both the increasing number of vector units available in the computing architectures, and the increasing width of such units, it is imperative to write algorithms and code for taking advantage of vector instructions. We chose to investigate multi-body potentials, as they are not as easily vectorizable as the pair potentials. As a matter of fact, their optimization pushes the boundaries of current compiler technology: our experience suggests that for such problems, compilers alone are not able to deliver appreciable speedups. By constrast, by resorting to the explicit use of instrinsics, we demonstrate significant improvements both on CPUs and the Intel Xeon Phi. Nonetheless, it is clear that the direct use of intrinsics is not a sustainable solution, not only for the effort and expertise it requires, but also because it hinders readability, extensibility and maintainability of code. In our case study, we wrote a C++ implementation of the Tersoff potential for the LAMMPS molecular dynamics simulation package. To alleviate the difficulty of using intrinsics, we used a template-based approach and abstracted from the instruction set, the vector length and the floating point precision. Instead of having to write 18 different kernels, this approach allowed us to only have one source code, from which all implementations (scalar, SSE, AVX, AVX2, AVX512, and IMCI, in single, double, and mixed precision) are automatically derived. Such an infrastructure makes it possible to compare kernels and identify which algorithms are suitable for which architecture. Without optimizations, we observed that a Intel Xeon E5-2650 CPU (Sandy Bridge) is considerably faster than a Xeon Phi. With our optimizations, the performance of both CPU and accelerator improved, and thanks to its wide vector units, the Xeon Phi takes the lead. For the future, besides monitoring the compilers' improvements, one should evaluate OpenMP 4.1 and its successors, which will include support for directive-based vectorization.**A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics**High Performance Computing: 30th International Conference, ISC High Performance 2015, Lecture Notes in Computer Science, Volume 9137, pp. 155-170, Springer International Publishing, July 2015.@inproceedings{Springer2015:450, author = "Paul Springer and {Ahmed E.} Ismail and Paolo Bientinesi", title = "A Scalable, Linear-Time Dynamic Cutoff Algorithm for Molecular Dynamics", booktitle = "High Performance Computing: 30th International Conference, ISC High Performance 2015", year = 2015, volume = 9137, series = "Lecture Notes in Computer Science", pages = "155-170", month = jul, publisher = "Springer International Publishing", url = "https://arxiv.org/pdf/1701.05242.pdf" }

abstractwebPDFbibtexRecent results on supercomputers show that beyond 65K cores, the efficiency of molecular dynamics simulations of interfacial sys- tems decreases significantly. In this paper, we introduce a dynamic cutoff method (DCM) for interfacial systems of arbitrarily large size. The idea consists in adopting a cutoff-based method in which the cutoff is cho- sen on a particle-by-particle basis, according to the distance from the interface. Computationally, the challenge is shifted from the long-range solvers to the detection of the interfaces and to the computation of the particle-interface distances. For these tasks, we present linear-time algo- rithms that do not rely on global communication patterns. As a result, the DCM algorithm is suited for large systems of particles and mas- sively parallel computers. To demonstrate its potential, we integrated DCM into the LAMMPS open-source molecular dynamics package, and simulated large liquid/vapor systems on two supercomputers: SuperMuc and JUQUEEN. In all cases, the accuracy of DCM is comparable to the traditional particle-particle particle-mesh (PPPM) algorithm, while the performance is considerably superior for large numbers of particles. For JUQUEEN, we provide timings for simulations running on the full system (458, 752 cores), and show nearly perfect strong and weak scaling.**On the Performance Prediction of BLAS-based Tensor Contractions**High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, April 2015.@inproceedings{Peise2015:380, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "On the Performance Prediction of BLAS-based Tensor Contractions", booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation", year = 2015, editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.", volume = 8966, series = "Lecture Notes in Computer Science", pages = "193-212", month = apr, publisher = "Springer International Publishing", url = "http://arxiv.org/pdf/1409.8608v1" }

abstractwebPDFbibtexTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.**A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels**High Performance Computing for Computational Science -- VECPAR 2014, Lecture Notes in Computer Science, Volume 8969, pp. 245-258, Springer International Publishing, April 2015.@inproceedings{Peise2015:250, author = "Elmar Peise and Paolo Bientinesi", title = "A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels", booktitle = "High Performance Computing for Computational Science -- VECPAR 2014", year = 2015, editor = "Daydé, Michel and Marques, Osni and Nakajima, Kengo", volume = 8969, series = "Lecture Notes in Computer Science", pages = "245-258", month = apr, publisher = "Springer International Publishing", url = "https://arxiv.org/pdf/1402.5897v1" }

abstractwebPDFbibtexIt is universally known that caching is critical to attain high- performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.**Scalable and Efficient Linear Algebra Kernel Mapping for Low Energy Consumption on the Layers CGRA**Applied Reconfigurable Computing, Lecture Notes in Computer Science, Volume 9040, pp. 301-310, January 2015.@inproceedings{Rakossy2015:898, author = "{Zoltan Endre } Rakossy and Dominik Stengele and {Axel } Acosta-Aponte and Saumitra Chafekar and Paolo Bientinesi and {Anupam } Chattopadhyay", title = "Scalable and Efficient Linear Algebra Kernel Mapping for Low Energy Consumption on the Layers CGRA", booktitle = "Applied Reconfigurable Computing", year = 2015, volume = 9040, series = "Lecture Notes in Computer Science", pages = "301-310", month = jan, organization = "Ruhr-Universität Bochum", url = "http://arc2015.esit.rub.de" }

abstractwebPDFbibtexA scalable mapping is proposed for 3 important kernels from the Numerical Linear Algebra domain, to exploit architectural features to reach asymptotically optimal efficiency and a low energy consumption. Performance and power evaluations were done with input data set matrix sizes ranging from 64×64 to 16384×16384. 12 architectural variants with up to 10×10 processing elements were used to explore scalability of the mapping and the architecture, achieving < 10% energy increase for architectures up to 8×8 PEs coupled with performance speed-ups of more than an order of magnitude. This enables a clean area-performance trade-off on the Layers architecture while keeping energy constant over the variants.**Streaming Data from HDD to GPUs for Sustained Peak Performance**Proceedings of the Euro-Par 2013, 19th International European Conference on Parallel and Distributed Computing, Lecture Notes in Computer Science, Volume 8097, pp. 788-799, Springer Berlin Heidelberg, May 2013.@inproceedings{Beyer2013:618, author = "Lucas Beyer and Paolo Bientinesi", title = "Streaming Data from HDD to GPUs for Sustained Peak Performance", year = 2013, volume = 8097, series = "Lecture Notes in Computer Science", pages = "788-799", month = may, publisher = "Springer Berlin Heidelberg", url = "http://arxiv.org/pdf/1302.4332v1.pdf" }

abstractwebPDFbibtexIn the context of the genome-wide association studies (GWAS), one has to solve long sequences of generalized least-squares problems; such a task has two limiting factors: execution time --often in the range of days or weeks-- and data management --data sets in the order of Terabytes. We present an algorithm that obviates both issues. By pipelining the computation, and thanks to a sophisticated transfer strategy, we stream data from hard disk to main memory to GPUs and achieve sustained peak performance; with respect to a highly-optimized CPU implementation, our algorithm shows a speedup of 2.6x. Moreover, the approach lends itself to multiple GPUs and attains almost perfect scalability. When using 4 GPUs, we observe speedups of 9x over the aforementioned implementation, and 488x over a widespread biology library.**Algorithms for Large-scale Whole Genome Association Analysis**Proceedings of the 20th European MPI Users' Group Meeting, EuroMPI '13, pp. 229-234, ACM, 2013.@inproceedings{Peise2013:504, author = "Elmar Peise and Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Algorithms for Large-scale Whole Genome Association Analysis ", booktitle = "Proceedings of the 20th European MPI Users' Group Meeting", year = 2013, series = "EuroMPI '13", pages = "229--234 ", address = "New York, NY, USA", publisher = "ACM", url = "http://arxiv.org/pdf/1304.2272v1" }

abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datasets.**A Domain-Specific Compiler for Linear Algebra Operations**High Performance Computing for Computational Science - VECPAR 2012, Lecture Notes in Computer Science, Volume 7851, pp. 346-361, Springer, 2013.@inproceedings{Fabregat-Traver2013:340, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "A Domain-Specific Compiler for Linear Algebra Operations", booktitle = "High Performance Computing for Computational Science - VECPAR 2012", year = 2013, editor = "M. Dayde, O. Marques, and K. Nakajima", volume = 7851, series = "Lecture Notes in Computer Science", pages = "346--361", address = "Heidelberg", publisher = "Springer", url = "http://arxiv.org/pdf/1205.5975v1" }

abstractwebPDFbibtexWe present a prototypical linear algebra compiler that automatically exploits domain-specific knowledge to generate high-performance algorithms. The input to the compiler is a target equation together with knowledge of both the structure of the problem and the properties of the operands. The output is a variety of high-performance algorithms to solve the target equation. Our approach consists in the decomposition of the input equation into a sequence of library-supported kernels. Since in general such a decomposition is not unique, our compiler returns not one but a number of algorithms. The potential of the compiler is shown by means of its application to a challenging equation arising within the*genome-wide association study*. As a result, the compiler produces multiple "best" algorithms that outperform the best existing libraries.**Performance Modeling for Dense Linear Algebra**Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (PMBS12), SCC '12, pp. 406-416, IEEE Computer Society, November 2012.@inproceedings{Peise2012:50, author = "Elmar Peise and Paolo Bientinesi", title = "Performance Modeling for Dense Linear Algebra", booktitle = "Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (PMBS12)", year = 2012, series = "SCC '12", pages = "406--416", address = "Washington, DC, USA", month = nov, publisher = "IEEE Computer Society", url = "http://arxiv.org/pdf/1209.2364" }

abstractwebPDFbibtexIt is well known that the behavior of dense linear algebra algorithms is greatly influenced by factors like target architecture, underlying libraries and even problem size; because of this, the accurate prediction of their performance is a real challenge. In this article, we are not interested in creating accurate models for a given algorithm, but in correctly ranking a set of equivalent algorithms according to their performance. Aware of the hierarchical structure of dense linear algebra routines, we approach the problem by developing a framework for the automatic generation of statistical performance models for BLAS and LAPACK libraries. This allows us to obtain predictions through evaluating and combining such models. We demonstrate that our approach is successful in both single- and multi-core environments, not only in the ranking of algorithms but also in tuning their parameters.**An Example of Symmetry Exploitation for Energy-related Eigencomputations**INTERNATIONAL CONFERENCE OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING 2009: (ICCMSE 2009), AIP Conference Proceedings, Volume 1504, pp. 1134-1137, 2012.@inproceedings{Petschow2012:860, author = "Matthias Petschow and Edoardo {Di Napoli} and Paolo Bientinesi", title = "An Example of Symmetry Exploitation for Energy-related Eigencomputations", booktitle = "INTERNATIONAL CONFERENCE OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING 2009: (ICCMSE 2009)", year = 2012, volume = 1504, series = "AIP Conference Proceedings", pages = "1134--1137", url = "http://link.aip.org/link/doi/10.1063/1.4772126" }

abstractwebPDFbibtexOne of the most used approaches in simulating materials is the tight-binding approximation. When using this method in a material simulation, it is necessary to compute the eigenvalues and eigenvectors of the Hamiltonian describing the system. In general, the system possesses few explicit symmetries. Due to them, the problem has many degenerate eigenvalues. The ambiguity in choosing a orthonormal basis of the invariant subspaces, associated with degenerate eigenvalues, will result in eigenvectors which are not invariant under the action of the symmetry operators in matrix form. A meaningful computation of the eigenvectors needs to take those symmetries into account. A natural choice is a set of eigenvectors, which simultaneously diagonalizes the Hamiltonian and the symmetry matrices. This is possible because all the matrices commute with each other. The simultaneous eigenvectors and the corresponding eigenvalues will be in a parametrized form in terms of the lattice momentum components. This functional dependence of the eigenvalues is the dispersion relation and describes the band structure of a material. Therefore it is important to find this functional dependence in any numerical computation related to material properties.**Execution-Less Performance Modeling**Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS11) held as part of the Supercomputing Conference (SC11), November 2011.webbibtex@inproceedings{Iakymchuk2011:258, author = "Roman Iakymchuk and Paolo Bientinesi", title = "Execution-Less Performance Modeling", booktitle = "Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS11) held as part of the Supercomputing Conference (SC11)", year = 2011, address = "Seattle, USA", month = nov, institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen University" }

**High-Performance Parallel Computations using Python as High-Level Language**Euro-Par 2010, Parallel Processing Workshops, Lecture Notes in Computer Science, Volume 6586, pp. 541-548, Springer, 2011.@inproceedings{Masini2011:8, author = "Stefano Masini and Paolo Bientinesi", title = "High-Performance Parallel Computations using Python as High-Level Language", booktitle = "Euro-Par 2010, Parallel Processing Workshops", year = 2011, editor = "Guarracino, M.R.; Vivien, F.; Traff, J.L.; Cannataro, M.; Danelutto, M.; Hast, A.; Perla, F.; KnÃ�ï¿½Ã¯Â¿Â½Ã�ï¿½Ã�Â¼pfer, A.; Di Martino, B.; Alexander, M. ", volume = 6586, series = "Lecture Notes in Computer Science", pages = "541--548", publisher = "Springer", url = "http://hpac.cs.umu.se/~pauldj/pubs/Python-TR.pdf" }

abstractwebPDFbibtexHigh-performance and parallel computations have always represented a challenge in terms of code optimization and memory usage, and have typically been tackled with languages that allow a low-level management of resources, like Fortran, C and C++. Nowadays, most of the implementation effort goes into constructing the bookkeeping logic that binds together functionalities taken from standard libraries. Because of the increasing complexity of this kind of codes, it becomes more and more necessary to keep it well organized through proper software engineering practices. Indeed, in the presence of chaotic implementations, reasoning about correctness is difficult, even when limited to specific aspects like concurrency; moreover, due to the lack in flexibility of the code, making substantial changes for experimentation becomes a grand challenge. Since the bookkeeping logic only accounts for a tiny fraction of the total execution time, we believe that for such a task it can be afforded to introduce an overhead due to a high-level language. We consider Python as a preliminary candidate with the intent of improving code readability, flexibility and, in turn, the level of confidence with respect to correctness. In this study, the bookkeeping logic of SMP-MRRR, a C & Fortran highly optimized multi-core eigensolver, is ported to Python. We report here on the porting process and on the pros and cons of using a high-level language in a high-performance parallel library.**Improving High-Performance Computations on Clouds Through Resource Underutilization**Proceedings of ACM 26th Symposium on Applied Computing, pp. 119-126, ACM, 2011.@inproceedings{Iakymchuk2011:404, author = "Roman Iakymchuk and Jeff Napper and Paolo Bientinesi", title = "Improving High-Performance Computations on Clouds Through Resource Underutilization", booktitle = "Proceedings of ACM 26th Symposium on Applied Computing", year = 2011, pages = "119--126 ", address = "Taichung, Taiwan", publisher = "ACM" }

abstractwebbibtexWe investigate the effects of shared resources for high-performance computing in a commercial cloud environment where multiple virtual machines share a single hardware node. Although good performance is occasionally obtained, contention degrades the expected performance and introduces significant variance. Using the DGEMM kernel and the HPL benchmark, we show that the underutilization of resources considerably improves expected performance by reducing contention for the CPU and cache space. For instance, for some cluster configurations, the solution is reached almost an order of magnitude earlier on average when the available resources are underutilized. The performance benefits for single node computations are even more impressive: Underutilization improves the expected execution time by two orders of magnitude. Finally, in contrast to unshared clusters, extending underutilized clusters by adding more nodes often improves the execution time due to an increased parallelism even with a slow interconnect. In the best case, by underutilizing the nodes performance was improved enough to entirely offset the cost of an extra node in the cluster.**Automatic Generation of Loop-Invariants for Matrix Operations**Computational Science and its Applications, International Conference, pp. 82-92, IEEE Computer Society, 2011.@inproceedings{Fabregat-Traver2011:238, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Automatic Generation of Loop-Invariants for Matrix Operations", booktitle = "Computational Science and its Applications, International Conference", year = 2011, pages = "82-92", address = "Los Alamitos, CA, USA", publisher = "IEEE Computer Society", url = "http://hpac.cs.umu.se/aices/preprint/documents/AICES-2011-02-01.pdf" }

abstractwebPDFbibtexIn recent years it has been shown that for many linear algebra operations it is possible to create families of algorithms following a very systematic procedure. We do not refer to the fine tuning of a known algorithm, but to a methodology for the actual generation of both algorithms and routines to solve a given target matrix equation. Although systematic, the methodology relies on complex algebraic manipulations and non-obvious pattern matching, making the procedure challenging to be performed by hand, our goal is the development of a fully automated system that from the sole description of a target equation creates multiple algorithms and routines. We present CL1ck, a symbolic system written in Mathematica, that starts with an equation, decomposes it into multiple equations, and returns a set of loop-invariants for the algorithms -- yet to be generated -- that will solve the equation. In a successive step each loop-invariant is then mapped to its corresponding algorithm and routine. For a large class of equations, the methodology generates known algorithms as well as many previously unknown ones. Most interestingly, the methodology unifies algorithms traditionally developed in isolation. As an example, the five well known algorithms for the LU factorization are for the first time unified under a common root.**Knowledge-Based Automatic Generation of Partitioned Matrix Expressions**Computer Algebra in Scientific Computing, Lecture Notes in Computer Science, Volume 6885, pp. 144-157, Springer, 2011.@inproceedings{Fabregat-Traver2011:54, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Knowledge-Based Automatic Generation of Partitioned Matrix Expressions", booktitle = "Computer Algebra in Scientific Computing", year = 2011, editor = "Gerdt, Vladimir and Koepf, Wolfram and Mayr, Ernst and Vorozhtsov, Evgenii", volume = 6885, series = "Lecture Notes in Computer Science", pages = "144-157", address = "Heidelberg", publisher = "Springer", url = "http://hpac.cs.umu.se/aices/preprint/documents/AICES-2011-01-03.pdf" }

abstractwebPDFbibtexIn a series of papers it has been shown that for many linear algebra operations it is possible to generate families of algorithms by following a systematic procedure. Although powerful, such a methodology involves complex algebraic manipulation, symbolic computations and pattern matching, making the generation a process challenging to be performed by hand. We aim for a fully automated system that from the sole description of a target operation creates multiple algorithms without any human intervention. Our approach consists of three main stages. The first stage yields the core object for the entire process, the Partitioned Matrix Expression (PME), which establishes how the target problem may be decomposed in terms of simpler sub-problems. In the second stage the PME is inspected to identify predicates, the Loop-Invariants, to be used to set up the skeleton of a family of proofs of correctness. In the third and last stage the actual algorithms are constructed so that each of them satisfies its corresponding proof of correctness. In this paper we focus on the first stage of the process, the automatic generation of Partitioned Matrix Expressions. In particular, we discuss the steps leading to a PME and the knowledge necessary for a symbolic system to perform such steps. We also introduce Cl1ck, a prototype system written in Mathematica that generates PMEs automatically.**Matrix Structure Exploitation in Generalized Eigenproblems Arising in Density Functional Theory**ICNAAM 2010: International Conference of Numerical Analysis and Applied Mathematics 2010, AIP Conference Proceedings, Volume 1281, pp. 937-940, American Institute of Physics, 2010.@inproceedings{Di_Napoli2010:830, author = "Edoardo {Di Napoli} and Paolo Bientinesi", title = "Matrix Structure Exploitation in Generalized Eigenproblems Arising in Density Functional Theory", booktitle = "ICNAAM 2010: International Conference of Numerical Analysis and Applied Mathematics 2010", year = 2010, volume = 1281, series = "AIP Conference Proceedings", pages = "937--940", publisher = "American Institute of Physics", url = "http://hpac.cs.umu.se/~pauldj/pubs/ICNAAM-Edo-TR.pdf" }

abstractwebPDFbibtexIn this short paper, the authors report a new computational approach in the context of Density Functional Theory (DFT). It is shown how it is possible to speed up the self-consistent cycle (iteration) characterizing one of the most well-known DFT implementations: FLAPW. Generating the Hamiltonian and overlap matrices and solving the associated generalized eigenproblems A x = l B x constitute the two most time-consuming fractions of each iteration. Two promising directions, implementing the new methodology, are presented that will ultimately improve the performance of the generalized eigensolver and save computational time.**Benchmarking Different Direct Solution Methods For Large Power System Simulation**Grand Challenges in Modeling & Simulation, SummerSIM'10, pp. 280-284, 2010.bibtex@inproceedings{Benigni2010:654, author = "Andrea Benigni and Paolo Bientinesi and Antonello Monti", title = "Benchmarking Different Direct Solution Methods For Large Power System Simulation", booktitle = "Grand Challenges in Modeling & Simulation, SummerSIM'10", year = 2010, pages = "280--284 " }

**Towards Mechanical Derivation of Krylov Solver Libraries**Proceedings of the ICCS2010, Procedia Computer Science, Volume 1(1), pp. 1805-1813, 2010.@inproceedings{Eijkhout2010:480, author = "Victor Eijkhout and Paolo Bientinesi and Robert {van de Geijn}", title = "Towards Mechanical Derivation of Krylov Solver Libraries", year = 2010, volume = 1, number = 1, series = "Procedia Computer Science", pages = "1805--1813", url = "http://hpac.cs.umu.se/~pauldj/pubs/Krylov-TR.pdf" }

abstractwebPDFbibtexIn a series of papers, it has been shown that algorithms for dense linear algebra operations can be systematically and even mechanically derived from the mathematical specification of the operation. A frequent question has been whether the methodology can be broadened to iterative methods. In this paper, we show preliminary evidence that this is indeed the case for so-called Krylov subspace methods. Our aims with this are two-fold: first of all, we show how the FLAME paradigm can simplify the derivation of subspace iteration methods. In view of this, the fact that we only derive the classical conjugate gradient method should be viewed as a promise for the future, rather than as a limitation of this approach. Secondly, and more importantly, our use of FLAME shows how mechanical reasoning can derive full algorithm specifications from the constraints (for instance, orthogonality conditions) on the generated results. If we tie this to ongoing research in automatic optimized code generated in FLAME, we see that this research is a necessary building block towards automatic generation of optimized library software.**On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors**Proceedings of the PPAM 2009, Eighth International Conference on Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, Volume 6067, Springer, 2010.@inproceedings{Lessig2010:138, author = "Christian Lessig and Paolo Bientinesi", title = "On Parallelizing the MRRR Algorithm for Data-Parallel Coprocessors", year = 2010, volume = 6067, series = "Lecture Notes in Computer Science", publisher = "Springer", url = "http://hpac.cs.umu.se/~pauldj/pubs/MRRR-GPU.pdf" }

abstractwebPDFbibtexThe eigenvalues and eigenvectors of a symmetric matrix are needed in a myriad of applications in computational engineering and computational science. One of the fastest and most accurate eigensolvers is the Algorithm of Multiple Relatively Robust Representations (MRRR). This is the first stable algorithm that computes k eigenvalues and eigenvectors of a tridiagonal symmetric matrix in O(nk) time. We present a parallelization of the MRRR algorithm for data parallel coprocessors using the CUDA programming environment. The results clearly demonstrate the potential of data-parallel coprocessors for scientific computations: when comparing against routine sstemr, LAPACKâ��s implementation of MRRR, our parallel algorithm provides 10-fold speedups.**Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures**Proceedings of the PPAM 2009, Eighth International Conference on Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, Volume 6067, Springer, 2010.@inproceedings{Bientinesi2010:970, author = "Paolo Bientinesi and Francisco Igual and Daniel Kressner and {Enrique S.} Quintana-Orti", title = "Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures", year = 2010, volume = 6067, series = "Lecture Notes in Computer Science", publisher = "Springer" }

abstractwebbibtexWe investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem, on general-purpose multi-core processors. In response to the advances of hardware accelerators, we also modify the code in SBR to accelerate the computation by off-loading a significant part of the operations to a graphics processor (GPU). Performance results illustrate the parallelism and scalability of these algorithms on current high-performance multi-core architectures.**The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors**Proceedings of the PARA 2010, Reykjavik, Iceland, June 6-9, 2010, Lecture Notes in Computer Science, Volume 7133, pp. 152-161, Springer, 2010.@inproceedings{Petschow2010:754, author = "Matthias Petschow and Paolo Bientinesi", title = "The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors", year = 2010, editor = "Jonasson, Kristjan", volume = 7133, series = "Lecture Notes in Computer Science", pages = "152--161", publisher = "Springer", url = "http://hpac.cs.umu.se/aices/preprint/documents/AICES-2010-09-04.pdf" }

abstractwebPDFbibtexThe algorithm of Multiple Relatively Robust Representations (MRRR or MR3) computes k eigenvalues and eigenvectors of a symmetric tridiagonal matrix in O(nk) arithmetic operations. Large problems can be effectively tackled with existing distributed-memory parallel implementations of MRRR; small and medium size problems can instead make use of LAPACK's routine xSTEMR. However, xSTEMR is optimized for single-core CPUs, and does not take advantage of today's multi-core and future many-core architectures. In this paper we discuss some of the issues and trade-offs arising in the design of MR3-SMP, an algorithm for multi-core CPUs and SMP systems. Experiments on application matrices indicate that MR3-SMP is both faster and obtains better speedups than all the tridiagonal eigensolvers included in LAPACK and Intel's Math Kernel Library (MKL).**Automatic Generation of Partitioned Matrix Expressions for Matrix Operations.**Proceedings of the 8th International Conference of Numerical Analysis and Applied Mathematics (ICNAAM 2010), AIP Conference Proceedings, Volume 1281, pp. 774-777, 2010.@inproceedings{Fabregat-Traver2010:304, author = "Diego Fabregat-Traver and Paolo Bientinesi", title = "Automatic Generation of Partitioned Matrix Expressions for Matrix Operations.", year = 2010, volume = 1281, series = "AIP Conference Proceedings", pages = "774-777", url = "http://hpac.cs.umu.se/aices/preprint/documents/AICES-2010-10-01.pdf" }

abstractwebPDFbibtexWe target the automatic generation of formally correct algorithms and routines for linear algebra operations. Given the broad variety of architectures and configurations with which scientists deal, there does not exist one algorithmic variant that is suitable for all scenarios. Therefore, we aim to generate a family of algorithmic variants to attain high-performance for a broad set of scenarios. One of the authors has previously demonstrated that automatic derivation of a family of algorithms is possible when the Partitioned Matrix Expression (PME) of the target operation is available. The PME is a recursive definition that states the relations between submatrices in the input and the output operands. In this paper we describe all the steps involved in the automatic derivation of PMEs, thus making progress towards a fully automated system.**Can Cloud Computing Reach the TOP500?**UCHPC-MAW '09 Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, pp. 17-20, 2009.

In conjunction with the 2009 ACM International Conference on Computing Frontiers.@inproceedings{Napper2009:804, author = "Jeff Napper and Paolo Bientinesi", title = "Can Cloud Computing Reach the TOP500?", booktitle = "UCHPC-MAW '09 Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop ", year = 2009, pages = "17--20", note = "In conjunction with the 2009 ACM International Conference on Computing Frontiers", url = "http://hpac.cs.umu.se/~pauldj/pubs/EC2-Cloud.pdf" }

abstractwebPDFbibtexComputing as a utility has reached the mainstream. Scientists can now rent time on large commercial clusters through several vendors. The cloud computing model provides flexible support for "pay as you go" systems. In addition to no upfront investment in large clusters or supercomputers, such systems incur no maintenance costs. Furthermore, they can be expanded and reduced on-demand in real-time. Current cloud computing performance falls short of systems specifically designed for scientific applications. Scientific computing needs are quite different from those of web applications--composed primarily of database queries--that have been the focus of cloud computing vendors. In this paper we investigate the use of cloud computing for high-performance numerical applications. In particular, we assume unlimited monetary resources to answer the question, "How high can a cloud computing service get in the TOP500 list?" We show results for the Linpack benchmark on different allocations on Amazon EC2.**Multi-dimensional Array Operations for Signal Processing Algorithms**PARA 2008: 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing, 2008.

To be published.@inproceedings{Bientinesi2008:640, author = "Paolo Bientinesi and Nikos Pitsianis and Xiaobai Sun", title = "Multi-dimensional Array Operations for Signal Processing Algorithms", booktitle = "PARA 2008: 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing", year = 2008, note = "To be published", url = "http://hpac.cs.umu.se/~pauldj/pubs/para-2008.pdf" }

abstractPDFbibtexGame and graphics processors are increasingly utilized for scientific computing applications and for signal processing in particular. This paper addresses the issue of efficiently mapping high-dimensional array operations for signal processing algorithms onto such computing platforms. Algorithms for fast Fourier transforms and convolutions are essential to many signal processing applications. Such algorithms entail fast high-dimensional data array operations. Game and graphics processors typically differ from general-purpose processors in memory hierarchy as well as in memory capacity and access patterns. We characterize the memory structures from the perspective of high-dimensional array operations, identify the mismatch between algorithmic dimension and architectural dimension and then describe the consequent penalty in memory latency. An architecture supports d-dimensional accesses if a d-dimensional data array can be accessed along one dimension as fast as along any other dimension. We introduce an approach to reduce latency and provide experimental results on the STI Cell Broadband Engine as supporting evidence.**Fast Computation of Local Correlation Coefficients**Proc. SPIE 7074, Advanced Signal Processing Algorithms, Architectures, and Implementations XVIII, Number 707405, 2008.@inproceedings{Bientinesi2008:478, author = "Paolo Bientinesi and Nikos Pitsianis and Xiaobai Sun", title = "Fast Computation of Local Correlation Coefficients", year = 2008, number = 707405, series = "Proc. SPIE 7074, Advanced Signal Processing Algorithms, Architectures, and Implementations XVIII" }

abstractwebbibtexThis paper presents an acceleration method, using both algorithmic and architectural means, for fast calculation of local correlation coefficients, which is a basic image-based information processing step for template or pattern matching, image registration, motion or change detection and estimation, compensation of changes, or compression of representations, among other information processing objectives. For real-time applications, the complexity in arithmetic operations as well as in programming and memory access latency had been a divisive issue between the so-called correction-based methods and the Fourier domain methods. In the presented method, the complexity in calculating local correlation coefficients is reduced via equivalent reformulation that leads to efficient array operations or enables the use of multi-dimensional fast Fourier transforms, without losing or sacrificing local and non-linear changes or characteristics. The computation time is further reduced by utilizing modern multi-core architectures, such as the Sony-Toshiba-IBM Cell processor, with high processing speed and low power consumption.**SuperMatrix: a Multithreaded Runtime Scheduling System for Algorithms-by-Blocks**Proceedings of ACM SIGPLAN 2008 Symposium on Principles and Practice of Parallel Programming (PPoPP'08), 2008.@inproceedings{Bientinesi2008:400, author = "Paolo Bientinesi and Ernie Chan and {Enrique S.} Quintana-Orti and Gregorio Quintana-Orti and Robert {van de Geijn}", title = "SuperMatrix: a Multithreaded Runtime Scheduling System for Algorithms-by-Blocks", booktitle = "Proceedings of ACM SIGPLAN 2008 Symposium on Principles and Practice of Parallel Programming (PPoPP'08)", year = 2008 }

abstractwebbibtexThis paper describes SuperMatrix, a runtime system that parallelizes matrix operations for SMP and/or multi-core architectures. We use this system to demonstrate how code described at a high level of abstraction can achieve high performance on such architectures while completely hiding the parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting of blocks that serve as units of data where operations over those blocks are treated as units of computation. The implementation transparently enqueues the required operations, internally tracking dependencies, and then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchitectures. This separation of concerns allows library developers to implement algorithms without concerning themselves with the parallelization aspect of the problem. Different heuristics for scheduling operations can be implemented in the runtime system independent of the code that enqueues the operations. Results gathered on a 16 CPU ccNUMA Itanium2 server demonstrate excellent performance.**Sparse Direct Factorizations through Unassembled Hyper-Matrices**Proceedings of the International Congress on Industrial and Applied Mathematics (ICIAM'07), Proc. Appl. Math. Mech., Volume 7(1), pp. 1010901-1010902, December 2007.webbibtex@inproceedings{Bientinesi2007:254, author = "Paolo Bientinesi and Victor Eijkhout and Jason Kurtz and Robert {van de Geijn}", title = "Sparse Direct Factorizations through Unassembled Hyper-Matrices", booktitle = "Proceedings of the International Congress on Industrial and Applied Mathematics (ICIAM'07)", year = 2007, volume = 7, number = 1, series = "Proc. Appl. Math. Mech.", pages = "1010901--1010902", month = dec }

**Formal Correctness and Stability of Dense Linear Algebra Algorithms**Proceedings of 17th IMACS World Congress: Scientific Computation, Applied Mathematics and Simulation, 2005.@inproceedings{Bientinesi2005:110, author = "Paolo Bientinesi and Robert {van de Geijn}", title = "Formal Correctness and Stability of Dense Linear Algebra Algorithms", booktitle = "Proceedings of 17th IMACS World Congress: Scientific Computation, Applied Mathematics and Simulation", year = 2005, url = "http://hpac.cs.umu.se/~pauldj/pubs/IMACS.pdf" }

abstractPDFbibtexThe Formal Linear Algebra Methods Environment (FLAME) project at UT-Austin pursues the mechanical derivation of algorithms for linear algebra operations. Rather than proving loop based algorithms correct a posteriori, a systematic methodology is employed that determines loop invariants from a mathematical specification of a given linear algebra operation. Algorithms and their correctness proofs are then constructively derived from this loop invariant. The process has been made mechanical via a prototype system implemented with Mathematica. Once an algorithm has been determined, a similarly systematic a posteriori process is used to determine the correctness in the presence of roundoff error (stability properties) of the algorithm. In this paper, we report progress towards the ultimate goal of full automation of the system.**Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory**PARA'04 State-of-the-Art in Scientific Computing, June 2004.@inproceedings{Bientinesi2004:690, author = "Paolo Bientinesi and Sergey Kolos and Robert {van de Geijn}", title = "Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory", booktitle = "PARA'04 State-of-the-Art in Scientific Computing", year = 2004, month = jun, url = "http://hpac.cs.umu.se/~pauldj/pubs/PARA04-automsyst.pdf" }

abstractPDFbibtexIt is our belief that the ultimate automatic system for deriving linear algebra libraries should be able to generate a set of algorithms starting from the mathematical specification of the target operation only. Indeed, one should be able to visit a website, fill in a form with information about the operation to be performed and about the target architectures, click the SUBMIT button, and receive an optimized library routine for that operation even if the operation has not previously been implemented. In this paper we relate recent advances towards what is probably regarded as an unreachable dream. We discuss the steps necessary to automatically obtain an algorithm starting from the mathematical abstract description of the operation to be solved. We illustrate how these steps have been incorporated into two prototype systems and we show the application of one the two systems to a problem from Control Theory: The Sylvester Equation. The output of this system is a description of an algorithm that can then be directly translated into code via API’s that we have developed. The description can also be passed as input to a parallel performance analyzer to obtain optimized parallel routines.**Rapid Development of High-Performance Linear Algebra Libraries**PARA'04 State-of-the-Art in Scientific Computing, 2004.@inproceedings{Bientinesi2004:554, author = "Paolo Bientinesi and {John A.} Gunnels and {Fred G.} Gustavson and {Greg M.} Henry and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "Rapid Development of High-Performance Linear Algebra Libraries", booktitle = "PARA'04 State-of-the-Art in Scientific Computing", year = 2004, url = "http://hpac.cs.umu.se/~pauldj/pubs/PARA04-worksheet.pdf" }

abstractPDFbibtexWe present a systematic methodology for deriving and implementing linear algebra libraries. It is quite common that an application requires a library of routines for the computation of linear algebra operations that are not (exactly) supported by commonly used libraries like LAPACK. In this situation, the application developer has the option of casting the operation into one supported by an existing library, often at the expense of performance, or implementing a custom library, often requiring considerable effort. Our recent discovery of a methodology based on formal derivation of algorithm allows such a user to quickly derive proven correct algorithms. Furthermore it provides an API that allows the so-derived algorithms to be quickly translated into high-performance implementations.**The Science of Programming High-Performance Linear Algebra Libraries**Proceedings of the Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL-02), 2002.@inproceedings{Bientinesi2002:448, author = "Paolo Bientinesi and {John A.} Gunnels and {Fred G.} Gustavson and {Greg M.} Henry and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "The Science of Programming High-Performance Linear Algebra Libraries", booktitle = "Proceedings of the Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL-02)", year = 2002, editor = "G. Baumgartner, J. Ramanujam, and P. Sadayappan " }

abstractbibtexWhen considering the unmanageable complexity of computer systems, Dijkstra recently made the following observations: 1. When exhaustive testing is impossible --i.e., almost always-- our trust can only be based on proof (be it mechanized or not). 2. A program for which it is not clear why we should trust it, is of dubious value. 3. A program should be structured in such a way that the argument for its correctness is feasible and not unnecessarily laborious. 4. Given the proof, deriving a program justified by it, is much easier than, given the program, constructing a proof justifying it. Over the last decade, our research in the area of the implementation of high performance sequential and parallel linear algebra libraries has made the wisdom of the above observations obvious. In this paper, we show how to apply formal derivation methods to linear algebra operations. By combining this approach with a high-level abstraction for coding such algorithms, a linear algebra library that is more desirable, as measured by the metric provided by the above observations, can be achieved. Moreover, such a library can include a richer array of algorithms and is, at the same time, more easily maintained and extended. Most surprisingly, the methodology results in more efficient implementations. The approach is sufficiently systematic that much of it can be, and has been, automated.

### Theses

**Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms**Ph.D. Thesis, Department of Computer Sciences, University of Texas at Austin, July 2006.@phdthesis{Bientinesi2006:720, author = "Paolo Bientinesi", title = "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms", school = "Department of Computer Sciences, University of Texas at Austin", year = 2006, type = "Ph.D. Thesis", month = jul, url = "http://hpac.cs.umu.se/~pauldj/pubs/dissertation.pdf" }

abstractPDFbibtexWe consider the problem of developing formally correct dense linear algebra libraries. The problem would be solved convincingly if, starting from the mathe- matical specification of a target operation, it were possible to generate, implement and analyze a family of correct algorithms that compute the operation. This thesis presents evidence that for a class of dense linear operations, systematic and me- chanical development of algorithms is within reach. It describes and demonstrates an approach for deriving and implementing, systematically and even mechanically, proven correct algorithms. It also introduces a systematic procedure to analyze, in a modular fashion, numerical properties of the generated algorithms.**Computational Geometry Techniques for Approximation of Electrostatic Forces**Master's Thesis, University of Pisa, 1998.

### Others

**The Matrix Chain Algorithm to Compile Linear Algebra Expressions**Proceedings of the 4th Workshop on Domain Specific Language Design and Implementation (DSLDI), Extended abstract, 31 October 2016.**Program Generation for Linear Algebra Using Multiple Layers of DSLs**Proceedings of the DSLDI 2016, Extended abstract, October 2016.bibtex@misc{Spampinato2016:960, author = "{Daniele } Spampinato and Diego Fabregat-Traver and Markus Pueschel and Paolo Bientinesi", title = "Program Generation for Linear Algebra Using Multiple Layers of DSLs", month = oct, year = 2016, type = "Extended abstract" }

**HPC Sharing in the Cloud**August 2010.@misc{Napper2010:958, author = "Jeff Napper and Roman Iakymchuk and Paolo Bientinesi", title = "HPC Sharing in the Cloud", howpublished = "HPCWire.com", month = aug, year = 2010 }

abstractwebbibtexMany scientific applications can require significant coordination between large numbers of nodes. Massive effort is spent in the HPC arena to hide much of the latency from coordination with expensive low-latency networks and fine-tuned communication libraries. Such efforts have not yet been translated to the commercial cloud computing arena, which still typically provide systems with varying amounts of installed memory but rarely various quality interconnects.

### Technical Reports

**The Landscape of Software for Tensor Computations**May 2021.@techreport{Psarras2021:198, author = "Christos Psarras and Lars Karlsson and Jiajia Li and Paolo Bientinesi", title = "The Landscape of Software for Tensor Computations", year = 2021, month = may, journal = "arXiv", url = "https://arxiv.org/pdf/2103.13756.pdf" }

abstractwebPDFbibtexTensors (also commonly seen as multi-linear operators or asmulti-dimensional arrays) are ubiquitous in scientific computing and in data science, and so are the software efforts for tensor operations. Particularly in recent years, we have observed an explosion in libraries, compilers, packages, and toolboxes; unfortunately these efforts are very much scattered among the different scientific domains, and inevitably suffer from replication, suboptimal implementations, and in many cases, limited visibility. As a first step towards countering these inefficiencies, here we survey and loosely classify software packages related to tensor computations. Our aim is to assemble a comprehensive and up-to-date snapshot of the tensor software landscape, with the intention of helping both users and developers.**Large Scale Parallel Computations in R through Elemental**October 2016.@techreport{Canales2016:718, author = "Rodrigo Canales and Elmar Peise and Paolo Bientinesi", title = "Large Scale Parallel Computations in R through Elemental", year = 2016, month = oct, url = "https://arxiv.org/pdf/1610.07310v1" }

abstractwebPDFbibtexEven though in recent years the scale of statistical analysis problems has increased tremendously, many statistical software tools are still limited to single-node computations. However, statistical analyses are largely based on dense linear algebra operations, which have been deeply studied, optimized and parallelized in the high-performance-computing community. To make high-performance distributed computations available for statistical analysis, and thus enable large scale statistical computations, we introduce RElem, an open source package that integrates the distributed dense linear algebra library Elemental into R. While on the one hand, RElem provides direct wrappers of Elemental's routines, on the other hand, it overloads various operators and functions to provide an entirely native R experience for distributed computations. We showcase how simple it is to port existing R programs to Relem and demonstrate that Relem indeed allows to scale beyond the single-node limitation of R with the full performance of Elemental without any overhead.**A Note on Time Measurements in LAMMPS**Aachen Institute for Computational Engineering Science, RWTH Aachen, February 2016.

Technical Report AICES-2016/02-1.@techreport{Tameling2016:140, author = "Daniel Tameling and Paolo Bientinesi and {Ahmed E.} Ismail", title = "A Note on Time Measurements in LAMMPS", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2016, month = feb, note = "Technical Report AICES-2016/02-1", url = "http://arxiv.org/abs/1602.05566" }

abstractPDFbibtexWe examine the issue of assessing the efficiency of components of a parallel program at the example of the MD package LAMMPS. In particular, we look at how LAMMPS deals with the issue and explain why the approach adopted might lead to inaccurate conclusions. The misleading nature of this approach is subsequently verified experimentally with a case study. Afterwards, we demonstrate how one should correctly determine the efficiency of the components and show what changes to the code base of LAMMPS are necessary in order to get the correct behavior.**Cache-aware Performance Modeling and Prediction for Dense Linear Algebra**November 2014.@techreport{Peise2014:904, author = "Elmar Peise and Paolo Bientinesi", title = "Cache-aware Performance Modeling and Prediction for Dense Linear Algebra", year = 2014, month = nov, url = "http://arxiv.org/pdf/1409.8602v1" }

abstractPDFbibtexCountless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK, and typically each operation can be obtained in many alternative ways. Interestingly, identifying the fastest implementation---without executing it---is a challenging task even for experts. An equally challenging task is that of tuning each routine to performance-optimal configurations. Indeed, the problem is so difficult that even the default values provided by the libraries are often considerably suboptimal; as a solution, normally one has to resort to executing and timing the routines, driven by some form of parameter search. In this paper, we discuss a methodology to solve both problems: identifying the best performing algorithm within a family of alternatives, and tuning algorithmic parameters for maximum performance; in both cases, we do not execute the algorithms themselves. Instead, our methodology relies on timing and modeling the computational kernels underlying the algorithms, and on a technique for tracking the contents of the CPU cache. In general, our performance predictions allow us to tune dense linear algebra algorithms within few percents from the best attainable results, thus allowing computational scientists and code developers alike to efficiently optimize their linear algebra routines and codes.**FLAME Derivation of the General Minimal Residual Method**Texas Advanced Computing Center, The University of Texas at Austin, November 2013.@techreport{Bientinesi2013:760, author = "Paolo Bientinesi and Victor Eijkhout", title = "FLAME Derivation of the General Minimal Residual Method", institution = "Texas Advanced Computing Center, The University of Texas at Austin", year = 2013, month = nov, url = "http://hpac.cs.umu.se/~pauldj/pubs/gmres.pdf" }

abstractPDFbibtexThe Conjugate Gradient method is found by constructing a sequence of residuals R that spans the Krylov space K and that is orthogonal. The GMRES method constructs another basis of the Krylov space, but the residuals S now satisfy 1) the residual is of minimum L2 length in each iteration, or equivalent, 2) the residuals are A-orthogonal. While it would be easy to construct the A-orthogonal basis of K, and in fact several methods do so, this construction can break down. For this reason, Saad and Schultz [2] proposed using a breakdown-free process, such as Arnoldi iteration, and constructing the basis by appropriate linear combinations.**Streaming Data from HDD to GPUs for Sustained Peak Performance**Aachen Institute for Computational Engineering Science, RWTH Aachen, February 2013.

Technical Report AICES-2013/02-1.@techreport{Beyer2013:398, author = "Lucas Beyer and Paolo Bientinesi", title = "Streaming Data from HDD to GPUs for Sustained Peak Performance", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2013, month = feb, note = "Technical Report AICES-2013/02-1", url = "https://arxiv.org/pdf/1302.4332" }

abstractPDFbibtex**Improved Orthogonality for Dense Hermitian Eigensolvers based on the MRRR algorithm**Aachen Institute for Computational Engineering Science, RWTH Aachen, September 2012.

Technical Report AICES-2012/09-1.@techreport{Petschow2012:754, author = "Matthias Petschow and {Enrique S.} Quintana-Orti and Paolo Bientinesi", title = "Improved Orthogonality for Dense Hermitian Eigensolvers based on the MRRR algorithm", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2012, month = sep, note = "Technical Report AICES-2012/09-1", url = "http://hpac.cs.umu.se/~pauldj/pubs/MixedPrecision.pdf" }

abstractPDFbibtexThe dense Hermitian eigenproblem is of outstanding importance in numerical computations and a number of excellent algorithms for this problem exist. One of the fastest method is the MRRR algorithm, which is based on a reduction to real tridiagonal form. This approach, although fast, does not deliver the same accuracy (orthogonality) as competing methods like the Divide-and-Conquer or the QR algorithm. In this paper, we demonstrate how the use of mixed precisions in MRRR-based eigensolvers leads to improved orthogonality. At the same time, when compared to the classical use of the MRRR algorithm, our approach comes with no or only limited performance penalty, increases the robustness, and improves scalability.**High-Throughput Genome-Wide Association Analysis for Single and Multiple Phenotypes**Aachen Institute for Computational Engineering Science, RWTH Aachen, July 2012.

Technical Report AICES-2012/07-1.@techreport{Fabregat-Traver2012:20, author = "Diego Fabregat-Traver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "High-Throughput Genome-Wide Association Analysis for Single and Multiple Phenotypes", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2012, month = jul, note = "Technical Report AICES-2012/07-1", url = "http://arxiv.org/abs/1207.2169" }

abstractPDFbibtexThe variance component tests used in genomewide association studies of thousands of individuals become computationally exhaustive when multiple traits are analysed in the context of omics studies. We introduce two high-throughput algorithms -- CLAK-CHOL and CLAK-EIG -- for single and multiple phenotype genome-wide association studies (GWAS). The algorithms, generated with the help of an expert system, reduce the computational complexity to the point that thousands of traits can be analyzed for association with millions of polymorphisms in a course of days on a standard workstation. By taking advantage of problem specific knowledge, CLAK-CHOL and CLAK-EIG significantly outperform the current state-of-the-art tools in both single and multiple trait analysis.**Solving Dense Generalized Eigenproblems on Multi-Threaded Architectures**Aachen Institute for Computational Engineering Science, RWTH Aachen, November 2011.

Technical Report AICES-2011/11-3.@techreport{Aliaga2011:348, author = "Jose' Aliaga and Paolo Bientinesi and Davor Davidovic and Edoardo {Di Napoli} and Francisco Igual and {Enrique S.} Quintana-Orti", title = "Solving Dense Generalized Eigenproblems on Multi-Threaded Architectures", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2011, month = nov, note = "Technical Report AICES-2011/11-3" }

abstractbibtex**Proof-Driven Derivation of Krylov Solver Libraries**Aachen Institute for Computational Engineering Science, RWTH Aachen, June 2010.

Technical Report AICES-2010/06-3.PDFbibtex@techreport{Eijkhout2010:0, author = "Victor Eijkhout and Paolo Bientinesi and Robert {van de Geijn}", title = "Proof-Driven Derivation of Krylov Solver Libraries ", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2010, month = jun, note = "Technical Report AICES-2010/06-3", url = "http://hpac.cs.umu.se/~pauldj/pubs/Krylov-TR.pdf" }

**The Science of Deriving Stability Analyses**Aachen Institute for Computational Engineering Science, RWTH Aachen, November 2008.

Technical Report AICES-2008-07.@techreport{Bientinesi2008:608, author = "Paolo Bientinesi and Robert {van de Geijn}", title = "The Science of Deriving Stability Analyses", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2008, month = nov, note = "Technical Report AICES-2008-07", url = "http://hpac.cs.umu.se/~pauldj/pubs/Stability-TR-2008.pdf" }

abstractPDFbibtexWe introduce a methodology for obtaining inventories of error results for families of numerical dense linear algebra algorithms. The approach for deriving the analyses is goal-oriented, systematic, and layered. The presentation places the analysis side-by-side with the algorithm so that it is obvious where error is introduced, making it attractive for use in the classroom. Yet the approach is sufficiently powerful to support the analysis of more complex algorithms, such as blocked variants that attain high performance. We provide what we believe to be the first analysis of a practical blocked LU factorization. Contrary to popular belief, the bound on the error is tighter than that of the corresponding unblocked algorithm.**Automation in Dense Linear Algebra**Aachen Institute for Computational Engineering Science, RWTH Aachen, October 2008.

Technical Report AICES-2008-2.PDFbibtex@techreport{Bientinesi2008:748, author = "Paolo Bientinesi and Robert {van de Geijn}", title = "Automation in Dense Linear Algebra", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2008, month = oct, note = "Technical Report AICES-2008-2", url = "http://hpac.cs.umu.se/~pauldj/pubs/CC-2009.pdf" }

**Multi-dimensional Array Memory Accesses for FFTs on Parallel Architectures.**Department of Computer Science, Duke University, December 2007.

Technical Report CS-2007-10.bibtex@techreport{Bientinesi2007:250, author = "Paolo Bientinesi and Nikos Pitsianis and Xiaobai Sun", title = "Multi-dimensional Array Memory Accesses for FFTs on Parallel Architectures. ", institution = "Department of Computer Science, Duke University", year = 2007, month = dec, note = "Technical Report CS-2007-10" }

**Sparse Direct Factorizations through Unassembled Hyper-Matrices**Texas Advanced Computing Center (TACC), The University of Texas at Austin, October 2007.

Technical Report TR-07-02.PDFbibtex@techreport{Bientinesi2007:178, author = "Paolo Bientinesi and Victor Eijkhout and Kyungjoo Kim and Jason Kurtz and Robert {van de Geijn}", title = "Sparse Direct Factorizations through Unassembled Hyper-Matrices", institution = "Texas Advanced Computing Center (TACC), The University of Texas at Austin", year = 2007, month = oct, note = "Technical Report TR-07-02", url = "http://hpac.cs.umu.se/~pauldj/pubs/uhmreport.pdf" }

**Parallel 2D FFTs on the Cell Broadband Engine**Department of Computer Science, Duke University, April 2007.

Technical Report CS-2007-03.bibtex@techreport{Bientinesi2007:4, author = "Paolo Bientinesi and Nikos Pitsianis and Xiaobai Sun", title = "Parallel 2D FFTs on the Cell Broadband Engine", institution = "Department of Computer Science, Duke University", year = 2007, month = apr, note = "Technical Report CS-2007-03" }

**Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms**The University of Texas at Austin, Department of Computer Sciences, July 2006.

Technical Report TR-06-46.@techreport{Bientinesi2006:4, author = "Paolo Bientinesi", title = "Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2006, month = jul, note = "Technical Report TR-06-46", url = "http://hpac.cs.umu.se/~pauldj/pubs/dissertation.pdf" }

abstractPDFbibtex**Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix**The University of Texas at Austin, Department of Computer Sciences, April 2006.

FLAME Working Note #19. Technical Report TR-06-20.bibtex@techreport{Bientinesi2006:610, author = "Paolo Bientinesi and Brian Gunter and Robert {van de Geijn}", title = "Families of Algorithms Related to the Inversion of a Symmetric Positive Definite Matrix", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2006, month = apr, note = "FLAME Working Note #19. Technical Report TR-06-20" }

**Representing Dense Linear Algebra Algorithms: A Farewell to Indices**The University of Texas at Austin, Department of Computer Sciences, February 2006.

FLAME Working Note #17. Technical Report TR-06-10.@techreport{Bientinesi2006:304, author = "Paolo Bientinesi and Robert {van de Geijn}", title = "Representing Dense Linear Algebra Algorithms: A Farewell to Indices", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2006, month = feb, note = "FLAME Working Note #17. Technical Report TR-06-10", url = "http://hpac.cs.umu.se/~pauldj/pubs/farewell.pdf" }

abstractPDFbibtexWe present a notation that allows a dense linear algebra algorithm to be represented in a way that is visually recognizable. The primary value of the notation is that it exposes subvectors and submatrices allowing the details of the algorithm to be the focus while hiding the intricate indices related to the arrays in which the vectors and matrices are stored. The applicability of the notation is illustrated through a succession of progressively complex case studies ranging from matrix-vector operations to the chasing of the bulge of the symmetric QR iteration. The notation facilitates comparing and contrasting different algorithms for the same operation as well as similar algorithms for different operations. Finally, we point out how algorithms represented with this notation can be directly translated into high-performance code.**FLAME 2005 Prospectus: Towards the Final Generation of Dense Linear Algebra Libraries**The University of Texas at Austin, Department of Computer Sciences, April 2005.

FLAME Working Note #16. Technical Report TR-05-15.bibtex@techreport{Bientinesi2005:218, author = "Paolo Bientinesi and Kazushige Goto and Tze {Meng Low} and {Enrique S.} Quintana-Orti and Robert {van de Geijn} and Field {Van Zee}", title = "FLAME 2005 Prospectus: Towards the Final Generation of Dense Linear Algebra Libraries", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2005, month = apr, note = "FLAME Working Note #16. Technical Report TR-05-15" }

**FLAME@Lab: A Farewell to Indices**The University of Texas at Austin, Department of Computer Sciences, April 2003.

FLAME Working Note #11. Technical Report TR-03-11.PDFbibtex@techreport{Bientinesi2003:828, author = "Paolo Bientinesi and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "FLAME@Lab: A Farewell to Indices", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2003, month = apr, note = "FLAME Working Note #11. Technical Report TR-03-11", url = "http://hpac.cs.umu.se/~pauldj/pubs/FLAME@lab-tr03-11.pdf" }

**A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations**The University of Texas at Austin, Department of Computer Sciences, March 2003.

Technical Report TR-03-26.PDFbibtex@techreport{Bientinesi2003:54, author = "Paolo Bientinesi and Inderjit Dhillon and Robert {van de Geijn}", title = "A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2003, month = mar, note = "Technical Report TR-03-26", url = "http://hpac.cs.umu.se/~pauldj/pubs/MR3-tr03-26.pdf" }

**The Science of Deriving Dense Linear Algebra Algorithms**The University of Texas at Austin, Department of Computer Sciences, September 2002.

FLAME Working Note #8. Technical Report TR-02-53.PDFbibtex@techreport{Bientinesi2002:440, author = "Paolo Bientinesi and {John A.} Gunnels and {Margaret E.} Myers and {Enrique S.} Quintana-Orti and Robert {van de Geijn}", title = "The Science of Deriving Dense Linear Algebra Algorithms", institution = "The University of Texas at Austin, Department of Computer Sciences", year = 2002, month = sep, note = "FLAME Working Note #8. Technical Report TR-02-53", url = "http://hpac.cs.umu.se/~pauldj/pubs/FLAME-tr02-53.pdf" }

**Electrostatic Fields Without Singularities: Implementation and Experiments**Institute for Computational Mathematics, CNR, Pisa, 1997.

Technical Report TR-B4-16-97.PDFbibtex@techreport{Bientinesi1997:760, author = "Paolo Bientinesi and Marco Pellegrini", title = "Electrostatic Fields Without Singularities: Implementation and Experiments", institution = "Institute for Computational Mathematics, CNR, Pisa", year = 1997, note = "Technical Report TR-B4-16-97", url = "http://hpac.cs.umu.se/~pauldj/pubs/Electrostatic.pdf" }