# Recent Publications

### Submitted Papers

**Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions**Submitted to ACM TOMS, October 2020.abstractwebPDFTensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.**Linnea: Automatic Generation of Efficient Linear Algebra Programs**December 2019.

Submitted to ACM TOMS.abstractwebPDFThe translation of linear algebra computations into efficient sequences of library calls is a non-trivial task that requires expertise in both linear algebra and high-performance computing. Almost all high-level languages and libraries for matrix computations (e.g., Matlab, Eigen) internally use optimized kernels such as those provided by BLAS and LAPACK; however, their translation algorithms are often too simplistic and thus lead to a suboptimal use of said kernels, resulting in significant performance losses. In order to combine the productivity offered by high-level languages, and the performance of low-level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem; as output, it returns an efficient sequence of calls to high-performance kernels. Linnea uses a custom best-first search algorithm to find a first solution in less than a second, and increasingly better solutions when given more time. In 125 test problems, the code generated by Linnea almost always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.**The Linear Algebra Mapping Problem**November 2019.

Submitted to SIAM Review.abstractwebPDFWe observe a disconnect between the developers and the end users of linear algebra libraries. On the one hand, the numerical linear algebra and the high-performance communities invest significant effort in the development and optimization of highly sophisticated numerical kernels and libraries, aiming at the maximum exploitation of both the properties of the input matrices, and the architectural features of the target computing platform. On the other hand, end users are progressively less likely to go through the error-prone and time consuming process of directly using said libraries by writing their code in C or Fortran; instead, languages and libraries such as Matlab, Julia, Eigen and Armadillo, which offer a higher level of abstraction, are becoming more and more popular. Users are given the opportunity to code matrix computations with a syntax that closely resembles the mathematical description; it is then a compiler or an interpreter that internally maps the input program to lower level kernels, as provided by libraries such as BLAS and LAPACK. Unfortunately, our experience suggests that in terms of performance, this translation is typically vastly suboptimal. In this paper, we first introduce the Linear Algebra Mapping Problem, and then investigate how effectively a benchmark of test problems is solved by popular high-level programming languages. Specifically, we consider Matlab, Octave, Julia, R, Armadillo (C++), Eigen (C++), and NumPy (Python); the benchmark is meant to test both standard compiler optimizations such as common subexpression elimination and loop-invariant code motion, as well as linear algebra specific optimizations such as optimal parenthesization of a matrix product and kernel selection for matrices with properties. The aim of this study is to give concrete guidelines for the development of languages and libraries that support linear algebra computations.

### Journal Articles

**Accelerating AIREBO: Navigating the Journey from Complex Legacy Code to High Performance**Journal of Computational Chemistry, Volume 40(14), May 2019.@article{HÃ¶hnerbach2019:708, author = "Markus Höhnerbach and Paolo Bientinesi", title = "Accelerating AIREBO: Navigating the Journey from Complex Legacy Code to High Performance", journal = "Journal of Computational Chemistry", year = 2019, volume = 40, number = 14, month = may, url = "https://arxiv.org/pdf/1810.07026.pdf" }

abstractwebPDFbibtexDespite initiatives to improve the quality of scientific codes, there still is a large presence of legacy code. Such code often needs to implement a lot of functionality under time constrains, sacrificing quality. Additionally, quality is rarely improved by optimizations for new architectures. This development model leads to code that is increasingly difficult to work with. Our suggested solution includes complexity-reducing refactoring and hardware abstraction. We focus on the AIREBO potential from LAMMPS, where the challenge is that any potential kernel is rather large and complex, hindering systematic optimization. This issue is common to codes that model multiple physical phenomena. We present our journey from the C++ port of a previous Fortran code to performance-portable, KNC-hybrid, vectorized, scalable, optimized code supporting full and reduced precision. The journey includes extensive testing that fixed bugs in the original code. Large-scale, full-precision runs sustain speedups of more than 4x (KNL) and 3x (Skylake).**Assessment of sound spatialisation algorithms for sonic rendering with headsets**Journal of New Music Research, Volume 48(2), pp. 107-124, March 2019.@article{Tarzan2019:528, author = "Ali Tarzan and Paolo Bientinesi and Marco Alunno", title = "Assessment of sound spatialisation algorithms for sonic rendering with headsets", journal = "Journal of New Music Research ", year = 2019, volume = 48, number = 2, pages = "107-124", month = mar, url = "https://arxiv.org/abs/1711.09234" }

abstractPDFbibtexGiven an input sound signal and a target virtual sound source, sound spatialisation algorithms manipulate the signal so that a listener perceives it as though it were emitted from the target source. There exist several established spatialisation approaches that deliver satisfactory results when loudspeakers are used to playback the manipulated signal. As headphones have a number of desirable characteristics over loudspeakers, such as portability, isolation from the surrounding environment, cost and ease of use, it is interesting to explore how a sense of acoustic space can be conveyed through them. This article first surveys traditional spatialisation approaches intended for loudspeakers, and then reviews them with regard to their adaptability to headphones.**Spin Summations: A High-Performance Perspective**ACM Transactions on Mathematical Software (TOMS), Volume 45(1), March 2019.@article{Springer2019:60, author = "Paul Springer and Devin Matthews and Paolo Bientinesi", title = "Spin Summations: A High-Performance Perspective", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2019, volume = 45, number = 1, month = mar, url = "https://arxiv.org/pdf/1705.06661.pdf" }

abstractwebPDFbibtexBesides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms---including CCSD(T) and CCSDT(Q)---are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to high-performance calculations, including temporal locality as well as scattered memory accesses. This publication explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4x and 5.5x over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2x over an out-of-place variant.**MatchPy: Pattern Matching in Python**Journal of Open Source Software, Volume 3(26), pp. 2, June 2018.

### Peer Reviewed Conference Publications

**A Timer-Augmented Cost Function for Load Balanced DSMC**13th International Meeting on High Performance Computing for Computational Science (VECPAR 18), Lecture Notes in Computer Science, Volume 11333, September 2019.@inproceedings{McDoniel2019:488, author = "William McDoniel and Paolo Bientinesi", title = "A Timer-Augmented Cost Function for Load Balanced DSMC", booktitle = "13th International Meeting on High Performance Computing for Computational Science (VECPAR 18)", year = 2019, volume = 11333, series = "Lecture Notes in Computer Science", month = sep, url = "https://arxiv.org/pdf/1902.06040.pdf" }

abstractPDFbibtexDue to a hard dependency between time steps, large-scale simulations of gas using the Direct Simulation Monte Carlo (DSMC) method proceed at the pace of the slowest processor. Scalability is therefore achievable only by ensuring that the work done each time step is as evenly apportioned among the processors as possible. Furthermore, as the simulated system evolves, the load shifts, and thus this load-balancing typically needs to be performed multiple times over the course of a simulation. Common methods generally use either crude performance models or processor-level timers. We combine both to create a timer-augmented cost function which both converges quickly and yields well-balanced processor decompositions. When compared to a particle-based performance model alone, our method achieves 2x speedup at steady-state on up to 1024 processors for a test case consisting of a Mach 9 argon jet impacting a solid wall.**Automatic Generation of Efficient Linear Algebra Programs**To appear in the Proceedings of the Platform for Advanced Scientific Computing Conference (PASC20), August 2019.@inproceedings{Barthels2019:200, author = "Henrik Barthels and Christos Psarras and Paolo Bientinesi", title = "Automatic Generation of Efficient Linear Algebra Programs", booktitle = "To appear in the Proceedings of the Platform for Advanced Scientific Computing Conference (PASC20)", year = 2019, month = aug, journal = "CoRR" }

abstractwebbibtexThe level of abstraction at which application experts reason about linear algebra computations and the level of abstraction used by developers of high-performance numerical linear algebra libraries do not match. The former is conveniently captured by high-level languages and libraries such as Matlab and Eigen, while the latter expresses the kernels included in the BLAS and LAPACK libraries. Unfortunately, the translation from a high-level computation to an efficient sequence of kernels is a task, far from trivial, that requires extensive knowledge of both linear algebra and high-performance computing. Internally, almost all high-level languages and libraries use efficient kernels; however, the translation algorithms are too simplistic and thus lead to a suboptimal use of said kernels, with significant performance losses. In order to both achieve the productivity that comes with high-level languages, and make use of the efficiency of low level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem and produces as output an efficient sequence of calls to high-performance kernels. In 25 application problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.**Code Generation in Linnea**Proceedings of the 6th International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2019), 22 June 2019.

Extended abstract.@inproceedings{Barthels2019:800, author = "Henrik Barthels and Paolo Bientinesi", title = "Code Generation in Linnea", year = 2019, address = "Phoenix, Arizona", month = jun, note = "Extended abstract", url = "http://hpac.cs.umu.se/~barthels/publications/ARRAY_2019.pdf" }

abstractPDFbibtexLinnea is a code generator for the translation of high-level linear algebra problems to efficient code. Unlike other languages and libraries for linear algebra, Linnea heavily relies on domain-specific knowledge to rewrite expressions and infer matrix properties. Here we focus on two aspects related to code generation and matrix properties: 1) The automatic generation of code consisting of explicit calls to BLAS and LAPACK kernels, and the corresponding challenge with specialized storage formats. 2) A general notion of banded matrices can be used to simplify the inference of many matrix properties. While it is crucial to make use of matrix properties to achieve high performance, inferring those properties is challenging. We show how matrix bandwidth can be used as a unifying language to reason about many common matrix properties.