# Recent Publications

### Journal Articles

**Algorithm 1026: Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions**ACM Transactions on Mathematical Software, Volume 48(3), pp. 1-20, September 2022.@article{Psarras2022:980, author = "Christos Psarras and Lars Karlsson and Rasmus Bro and Paolo Bientinesi", title = "Algorithm 1026: Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions", journal = "ACM Transactions on Mathematical Software", year = 2022, volume = 48, number = 3, pages = "1--20", month = sep, doi = "10.1145/3519383" }

abstractwebbibtexTensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.**The Linear Algebra Mapping Problem. Current state of linear algebra languages and libraries.**ACM Transactions on Mathematical Software, Volume 48(3), pp. 1-30, September 2022.@article{Psarras2022:618, author = "Christos Psarras and Henrik Barthels and Paolo Bientinesi", title = "The Linear Algebra Mapping Problem. Current state of linear algebra languages and libraries.", journal = "ACM Transactions on Mathematical Software", year = 2022, volume = 48, number = 3, pages = "1--30", month = sep, doi = "10.1145/3549935", url = "https://arxiv.org/pdf/1911.09421v2.pdf" }

abstractwebPDFbibtexWe observe a disconnect between the developers and the end users of linear algebra libraries. On the one hand, the numerical linear algebra and the high-performance communities invest significant effort in the development and optimization of highly sophisticated numerical kernels and libraries, aiming at the maximum exploitation of both the properties of the input matrices, and the architectural features of the target computing platform. On the other hand, end users are progressively less likely to go through the error-prone and time consuming process of directly using said libraries by writing their code in C or Fortran; instead, languages and libraries such as Matlab, Julia, Eigen and Armadillo, which offer a higher level of abstraction, are becoming more and more popular. Users are given the opportunity to code matrix computations with a syntax that closely resembles the mathematical description; it is then a compiler or an interpreter that internally maps the input program to lower level kernels, as provided by libraries such as BLAS and LAPACK. Unfortunately, our experience suggests that in terms of performance, this translation is typically vastly suboptimal. In this paper, we first introduce the Linear Algebra Mapping Problem, and then investigate how effectively a benchmark of test problems is solved by popular high-level programming languages. Specifically, we consider Matlab, Octave, Julia, R, Armadillo (C++), Eigen (C++), and NumPy (Python); the benchmark is meant to test both standard compiler optimizations such as common subexpression elimination and loop-invariant code motion, as well as linear algebra specific optimizations such as optimal parenthesization of a matrix product and kernel selection for matrices with properties. The aim of this study is to give concrete guidelines for the development of languages and libraries that support linear algebra computations.**Accelerating jackknife resampling for the Canonical Polyadic Decomposition**Frontiers in Applied Mathematics and Statistics, Volume 8, April 2022.@article{Psarras2022:328, author = "Christos Psarras and Lars Karlsson and Rasmus Bro and Paolo Bientinesi", title = "Accelerating jackknife resampling for the Canonical Polyadic Decomposition", journal = "Frontiers in Applied Mathematics and Statistics", year = 2022, volume = 8, month = apr, doi = "10.3389/fams.2022.830270", url = "https://www.frontiersin.org/articles/10.3389/fams.2022.830270/pdf" }

abstractwebPDFbibtexThe Canonical Polyadic (CP) tensor decomposition is frequently used as a model in applications in a variety of different fields. Using jackknife resampling to estimate parameter uncertainties is often desirable but results in an increase of the already high computational cost. Upon observation that the resampled tensors, though different, are nearly identical, we show that it is possible to extend the recently proposed Concurrent ALS (CALS) technique to a jackknife resampling scenario. This extension gives access to the computational efficiency advantage of CALS for the price of a modest increase (typically a few percent) in the number of floating point operations. Numerical experiments on both synthetic and real-world datasets demonstrate that the new workflow based on a CALS extension can be several times faster than a straightforward workflow where the jackknife submodels are processed individually.**Work-stealing prefix scan: Addressing load imbalance in large-scale image registration**IEEE Transactions on Parallel and Distributed Systems, Volume 33(3), pp. 523-535, 2022.@article{Copik2022:900, author = "Marcin Copik and {Tobias } Grosser and {Torsten } Hoefler and Paolo Bientinesi and Benjamin Berkels", title = "Work-stealing prefix scan: Addressing load imbalance in large-scale image registration", journal = "IEEE Transactions on Parallel and Distributed Systems", year = 2022, volume = 33, number = 3, pages = "523-535", url = "https://arxiv.org/pdf/2010.12478.pdf" }

abstractwebPDFbibtexParallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this paper, we study the recursive registration of a series of electron microscopy images – a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.

### Peer Reviewed Conference Publications

**FLOPs as a Discriminant for Dense Linear Algebra Algorithms**Proceedings of the 51st International Conference on Parallel Processing (ICPP 2022), August 2022.@inproceedings{Lopez2022:530, author = "Francisco Lopez and Lars Karlsson and Paolo Bientinesi", title = "FLOPs as a Discriminant for Dense Linear Algebra Algorithms", booktitle = "Proceedings of the 51st International Conference on Parallel Processing (ICPP 2022)", year = 2022, month = aug, url = "https://arxiv.org/pdf/2207.02070.pdf" }

abstractPDFbibtexExpressions that involve matrices and vectors, known as linear algebra expressions, are commonly evaluated through a sequence of invocations to highly optimised kernels provided in libraries such as BLAS and LAPACK. A sequence of kernels represents an algorithm, and in general, because of associativity, algebraic identities, and multiple kernels, one expression can be evaluated via many different algorithms. These algorithms are all mathematically equivalent (i.e., in exact arithmetic, they all compute the same result), but often differ noticeably in terms of execution time. When faced with a decision, high-level languages, libraries, and tools such as Julia, Armadillo, and Linnea choose by selecting the algorithm that minimises the FLOP count. In this paper, we test the validity of the FLOP count as a discriminant for dense linear algebra algorithms, analysing "anomalies": problem instances for which the fastest algorithm does not perform the least number of FLOPs. To do so, we focused on relatively simple expressions and analysed when and why anomalies occurred. We found that anomalies exist and tend to cluster into large contiguous regions. For one expression anomalies were rare, whereas for the other they were abundant. We conclude that FLOPs is not a sufficiently dependable discriminant even when building algorithms with highly optimised kernels. Plus, most of the anomalies remained as such even after filtering out the inter-kernel cache effects. We conjecture that combining FLOP counts with kernel performance models will significantly improve our ability to choose optimal algorithms.**MOM: Matrix Operations in MLIR**Proceedings of the 12th International Workshop on Polyhedral Compilation Techniques, May 2022.@inproceedings{Chiellini2022:920, author = "Lorenzo Chiellini and Henrik Barthels and Marcin Copik and {Tobias } Grosser and Paolo Bientinesi and {Daniele } Spampinato", title = "MOM: Matrix Operations in MLIR", year = 2022, address = "Budapest, Hungary", month = may }

abstractbibtexModern research in code generators for dense linear algebra computations has shown the ability to produce optimized code with a performance which compares and often exceeds the one of state-of-the-art implementations by domain experts. However, the underlying infrastructure is often developed in isolation making the interconnection of logically combinable systems complicated if not impossible. In this paper, we propose to leverage MLIR as a unifying compiler infrastructure for the optimization of dense linear algebra operations. We propose a new MLIR dialect for expressing linear algebraic computations including matrix properties to enable high-level algorithmic transformations. The integration of this new dialect in MLIR enables end-to-end compilation of matrix computations via conversion to existing lower-level dialects already provided by the framework.**Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorch**Proceedings of the iWAPT, March 2022.@inproceedings{_Sankaran2022:190, author = "Aravind Sankaran and {Navid Akbari} Alashti and Christos Psarras and Paolo Bientinesi", title = "Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorch", year = 2022, month = mar, url = "https://arxiv.org/pdf/2202.09888.pdf" }

abstractwebPDFbibtexLinear algebra operations, which are ubiquitous in machine learning, form major performance bottlenecks. The High-Performance Computing community invests significant effort in the development of architecture-specific optimized kernels, such as those provided by the BLAS and LAPACK libraries, to speed up linear algebra operations. However, end users are progressively less likely to go through the error prone and time-consuming process of directly using said kernels; instead, frameworks such as TensorFlow (TF) and PyTorch (PyT), which facilitate the development of machine learning applications, are becoming more and more popular. Although such frameworks link to BLAS and LAPACK, it is not clear whether or not they make use of linear algebra knowledge to speed up computations. For this reason, in this paper we develop benchmarks to investigate the linear algebra optimization capabilities of TF and PyT. Our analyses reveal that a number of linear algebra optimizations are still missing; for instance, reducing the number of scalar operations by applying the distributive law, and automatically identifying the optimal parenthesization of a matrix chain. In this work, we focus on linear algebra computations in TF and PyT; we both expose opportunities for performance enhancement to the benefit of the developers of the frameworks and provide end users with guidelines on how to achieve performance gains.**A Test for FLOPs as a Discriminant for Linear Algebra Algorithms**Proceedings of SBAC-PAD 2022, IEEE CPS, 2022.@inproceedings{_Sankaran2022:404, author = "Aravind Sankaran and Paolo Bientinesi", title = "A Test for FLOPs as a Discriminant for Linear Algebra Algorithms", booktitle = "Proceedings of SBAC-PAD 2022", year = 2022, publisher = "IEEE CPS" }

abstractbibtexLinear algebra expressions, which play a central role in countless scientific computations, are often computed via a sequence of calls to existing libraries of building blocks (such as those provided by BLAS and LAPACK). A sequence identifies a computing strategy, i.e., an algorithm, and normally for one linear algebra expression, many alternative algorithms exist. Although mathematically equivalent, those algorithms might exhibit significant differences in terms of performance. Several high-level languages and tools for matrix computations such as Julia, Armadillo, Linnea, etc., make algorithmic choices by minimizing the number of Floating Point Operations (FLOPs). However, there can be several algorithms that share the same (or have nearly identical) number of FLOPs; although in many cases, these algorithms exhibit execution times which are statistically equivalent and one could arbitrarily select one of them as a potential best candidate, it is not unlikely to find cases where the execution times are significantly different (despite the FLOP count being almost the same). It is also possible that the algorithm that minimizes FLOPs is not the one that minimizes execution time. In this work, we develop a methodology to test the reliability of FLOPs as discriminant for a given set of linear algebra algorithms. As a first step, all the algorithms are executed once and potential candidates are shortlisted; to this end, all the algorithms with minimum FLOPs, and those algorithms that perform more FLOPs but have the single-run execution times lower than or close enough to the algorithms with minimum FLOPs are selected. From this, an initial hypothesis is formed by ranking the potential candidates based on their single-run execu- tion times. Further measurements of the potential candidates are taken iteratively, and the ranks are updated or merged based on statistical evidence. The iteration of measurements automatically stops as the changes in ranks converge. The potential candidates with comparable performance profiles obtain the same rank. If FLOPs are a valid discriminant for a given set of algorithms, then all the algorithms with minimum FLOPs would end up having the best rank. If not, our methodology would classify the algorithms into performance classes, which can then be used in the investigation of the root cause of performance differences or serve as a ground truth for machine-learning based performance models.

### Thesis

**Linnea: A Compiler for Mapping Linear Algebra Problems onto High-Performance Kernel Libraries**Dissertation, RWTH Aachen University, August 2021.PDFbibtex@phdthesis{Barthels2021:754, author = "Henrik Barthels", title = "Linnea: A Compiler for Mapping Linear Algebra Problems onto High-Performance Kernel Libraries", school = "RWTH Aachen University", year = 2021, type = "Dissertation", month = aug, url = "https://hpac.cs.umu.se/~barthels/publications/dissertation.pdf" }

### Technical Report

**The Landscape of Software for Tensor Computations**June 2022.@techreport{Psarras2022:198, author = "Christos Psarras and Lars Karlsson and Jiajia Li and Paolo Bientinesi", title = "The Landscape of Software for Tensor Computations", year = 2022, month = jun, journal = "arXiv", url = "https://arxiv.org/pdf/2103.13756" }

abstractwebPDFbibtexTensors (also commonly seen as multi-linear operators or asmulti-dimensional arrays) are ubiquitous in scientific computing and in data science, and so are the software efforts for tensor operations. Particularly in recent years, we have observed an explosion in libraries, compilers, packages, and toolboxes; unfortunately these efforts are very much scattered among the different scientific domains, and inevitably suffer from replication, suboptimal implementations, and in many cases, limited visibility. As a first step towards countering these inefficiencies, here we survey and loosely classify software packages related to tensor computations. Our aim is to assemble a comprehensive and up-to-date snapshot of the tensor software landscape, with the intention of helping both users and developers.