# Recent Publications

### Submitted Papers

**Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions**Submitted to ACM TOMS, October 2020.abstractwebPDFTensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.**The Linear Algebra Mapping Problem. Current state of linear algebra languages and libraries.**November 2019.abstractwebPDFWe observe a disconnect between the developers and the end users of linear algebra libraries. On the one hand, the numerical linear algebra and the high-performance communities invest significant effort in the development and optimization of highly sophisticated numerical kernels and libraries, aiming at the maximum exploitation of both the properties of the input matrices, and the architectural features of the target computing platform. On the other hand, end users are progressively less likely to go through the error-prone and time consuming process of directly using said libraries by writing their code in C or Fortran; instead, languages and libraries such as Matlab, Julia, Eigen and Armadillo, which offer a higher level of abstraction, are becoming more and more popular. Users are given the opportunity to code matrix computations with a syntax that closely resembles the mathematical description; it is then a compiler or an interpreter that internally maps the input program to lower level kernels, as provided by libraries such as BLAS and LAPACK. Unfortunately, our experience suggests that in terms of performance, this translation is typically vastly suboptimal. In this paper, we first introduce the Linear Algebra Mapping Problem, and then investigate how effectively a benchmark of test problems is solved by popular high-level programming languages. Specifically, we consider Matlab, Octave, Julia, R, Armadillo (C++), Eigen (C++), and NumPy (Python); the benchmark is meant to test both standard compiler optimizations such as common subexpression elimination and loop-invariant code motion, as well as linear algebra specific optimizations such as optimal parenthesization of a matrix product and kernel selection for matrices with properties. The aim of this study is to give concrete guidelines for the development of languages and libraries that support linear algebra computations.

### Journal Articles

**Work-stealing prefix scan: Addressing load imbalance in large-scale image registration**IEEE Transactions on Parallel and Distributed Systems, Volume 33(3), pp. 523-535, 2022.@article{Copik2022:900, author = "Marcin Copik and {Tobias } Grosser and {Torsten } Hoefler and Paolo Bientinesi and Benjamin Berkels", title = "Work-stealing prefix scan: Addressing load imbalance in large-scale image registration", journal = "IEEE Transactions on Parallel and Distributed Systems", year = 2022, volume = 33, number = 3, pages = "523-535", url = "https://arxiv.org/pdf/2010.12478.pdf" }

abstractwebPDFbibtexParallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this paper, we study the recursive registration of a series of electron microscopy images – a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.**Linnea: Automatic Generation of Efficient Linear Algebra Programs**ACM Transactions on Mathematical Software (TOMS), Volume 47(3), pp. 1-26, June 2021.@article{Barthels2021:688, author = "Henrik Barthels and Christos Psarras and Paolo Bientinesi", title = "Linnea: Automatic Generation of Efficient Linear Algebra Programs", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2021, volume = 47, number = 3, pages = "1--26", month = jun, url = "https://arxiv.org/pdf/1912.12924.pdf" }

abstractwebPDFbibtexThe translation of linear algebra computations into efficient sequences of library calls is a non-trivial task that requires expertise in both linear algebra and high-performance computing. Almost all high-level languages and libraries for matrix computations (e.g., Matlab, Eigen) internally use optimized kernels such as those provided by BLAS and LAPACK; however, their translation algorithms are often too simplistic and thus lead to a suboptimal use of said kernels, resulting in significant performance losses. In order to combine the productivity offered by high-level languages, and the performance of low-level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem; as output, it returns an efficient sequence of calls to high-performance kernels. Linnea uses a custom best-first search algorithm to find a first solution in less than a second, and increasingly better solutions when given more time. In 125 test problems, the code generated by Linnea almost always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.**Rational spectral filters with optimal convergence rate**SIAM Journal of Scientific Computing, Volume 43(4), pp. A2660-A2684, 2021.@article{Kollnig2021:978, author = "Konrad Kollnig and Paolo Bientinesi and Edoardo {Di Napoli}", title = "Rational spectral filters with optimal convergence rate", journal = "SIAM Journal of Scientific Computing", year = 2021, volume = 43, number = 4, pages = "A2660-A2684", url = "https://arxiv.org/pdf/2001.04184.pdf" }

abstractwebPDFbibtexIn recent years, contour-based eigensolvers have emerged as a standard approach for the solution of large and sparse eigenvalue problems. Building upon recent performance improvements through non-linear least square optimization of so-called rational filters, we introduce a systematic method to design these filters by minimizing the worst-case convergence ratio and eliminate the parametric dependence on weight functions. Further, we provide an efficient way to deal with the box-constraints which play a central role for the use of iterative linear solvers in contour-based eigensolvers. Indeed, these parameter-free filters consistently minimize the number of iterations and the number of FLOPs to reach convergence in the eigensolver. As a byproduct, our rational filters allow for a simple solution to load balancing when the solution of an interior eigenproblem is approached by the slicing of the sought after spectral interval.

### Peer Reviewed Conference Publications

**Performance Comparison for Scientific Computations on the Edge via Relative Performance**Proceedings of the 3rd Workshop on Parallel AI and Systems for the Edge (PAISE 2021), March 2021.@inproceedings{_Sankaran2021:48, author = "Aravind Sankaran and Paolo Bientinesi", title = "Performance Comparison for Scientific Computations on the Edge via Relative Performance", year = 2021, month = mar, url = "https://arxiv.org/pdf/2102.12740.pdf" }

abstractwebPDFbibtexIn a typical Internet-of-Things setting that involves scientific applications, a target computation can be evaluated in many different ways depending on the split of computations among various devices. On the one hand, different implementations (or algorithms)---equivalent from a mathematical perspective---might exhibit significant difference in terms of performance. On the other hand, some of the implementations are likely to show similar performance characteristics. In this paper, we focus on analysing the performance of a given set of algorithms by clustering them into performance classes. To this end, we use a measurement-based approach to evaluate and score algorithms based on pair-wise comparisons; we refer to this approach as ``Relative performance analysis". Each comparison yields one of three outcomes: one algorithm can be ``better", ``worse", or ``equivalent" to another; those algorithms evaluating to have ``equivalent'' performance are merged into the same performance class. We show that our clustering methodology facilitates algorithm selection with respect to more than one metric; for instance, from the subset of equivalently fast algorithms, one could then select an algorithm that consumes the least energy on a certain device.**Automatic Generation of Efficient Linear Algebra Programs**Proceedings of the Platform for Advanced Scientific Computing Conference (PASC20), June 2020.@inproceedings{Barthels2020:200, author = "Henrik Barthels and Christos Psarras and Paolo Bientinesi", title = "Automatic Generation of Efficient Linear Algebra Programs", booktitle = "Proceedings of the Platform for Advanced Scientific Computing Conference (PASC20)", year = 2020, month = jun, journal = "CoRR" }

abstractwebbibtexThe level of abstraction at which application experts reason about linear algebra computations and the level of abstraction used by developers of high-performance numerical linear algebra libraries do not match. The former is conveniently captured by high-level languages and libraries such as Matlab and Eigen, while the latter expresses the kernels included in the BLAS and LAPACK libraries. Unfortunately, the translation from a high-level computation to an efficient sequence of kernels is a task, far from trivial, that requires extensive knowledge of both linear algebra and high-performance computing. Internally, almost all high-level languages and libraries use efficient kernels; however, the translation algorithms are too simplistic and thus lead to a suboptimal use of said kernels, with significant performance losses. In order to both achieve the productivity that comes with high-level languages, and make use of the efficiency of low level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem and produces as output an efficient sequence of calls to high-performance kernels. In 25 application problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.**A Timer-Augmented Cost Function for Load Balanced DSMC**13th International Meeting on High Performance Computing for Computational Science (VECPAR 18), Lecture Notes in Computer Science, Volume 11333, September 2019.@inproceedings{McDoniel2019:488, author = "William McDoniel and Paolo Bientinesi", title = "A Timer-Augmented Cost Function for Load Balanced DSMC", booktitle = "13th International Meeting on High Performance Computing for Computational Science (VECPAR 18)", year = 2019, volume = 11333, series = "Lecture Notes in Computer Science", month = sep, url = "https://arxiv.org/pdf/1902.06040.pdf" }

abstractPDFbibtexDue to a hard dependency between time steps, large-scale simulations of gas using the Direct Simulation Monte Carlo (DSMC) method proceed at the pace of the slowest processor. Scalability is therefore achievable only by ensuring that the work done each time step is as evenly apportioned among the processors as possible. Furthermore, as the simulated system evolves, the load shifts, and thus this load-balancing typically needs to be performed multiple times over the course of a simulation. Common methods generally use either crude performance models or processor-level timers. We combine both to create a timer-augmented cost function which both converges quickly and yields well-balanced processor decompositions. When compared to a particle-based performance model alone, our method achieves 2x speedup at steady-state on up to 1024 processors for a test case consisting of a Mach 9 argon jet impacting a solid wall.**Code Generation in Linnea**Proceedings of the 6th International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2019), 22 June 2019.

Extended abstract.@inproceedings{Barthels2019:800, author = "Henrik Barthels and Paolo Bientinesi", title = "Code Generation in Linnea", year = 2019, address = "Phoenix, Arizona", month = jun, note = "Extended abstract", url = "http://hpac.cs.umu.se/~barthels/publications/ARRAY_2019.pdf" }

abstractPDFbibtexLinnea is a code generator for the translation of high-level linear algebra problems to efficient code. Unlike other languages and libraries for linear algebra, Linnea heavily relies on domain-specific knowledge to rewrite expressions and infer matrix properties. Here we focus on two aspects related to code generation and matrix properties: 1) The automatic generation of code consisting of explicit calls to BLAS and LAPACK kernels, and the corresponding challenge with specialized storage formats. 2) A general notion of banded matrices can be used to simplify the inference of many matrix properties. While it is crucial to make use of matrix properties to achieve high performance, inferring those properties is challenging. We show how matrix bandwidth can be used as a unifying language to reason about many common matrix properties.

### Technical Report

**The Landscape of Software for Tensor Computations**May 2021.@techreport{Psarras2021:198, author = "Christos Psarras and Lars Karlsson and Jiajia Li and Paolo Bientinesi", title = "The Landscape of Software for Tensor Computations", year = 2021, month = may, journal = "arXiv", url = "https://arxiv.org/pdf/2103.13756.pdf" }

abstractwebPDFbibtexTensors (also commonly seen as multi-linear operators or asmulti-dimensional arrays) are ubiquitous in scientific computing and in data science, and so are the software efforts for tensor operations. Particularly in recent years, we have observed an explosion in libraries, compilers, packages, and toolboxes; unfortunately these efforts are very much scattered among the different scientific domains, and inevitably suffer from replication, suboptimal implementations, and in many cases, limited visibility. As a first step towards countering these inefficiencies, here we survey and loosely classify software packages related to tensor computations. Our aim is to assemble a comprehensive and up-to-date snapshot of the tensor software landscape, with the intention of helping both users and developers.