Christos Psarras

Aachen Institute for Advanced Study in

Computational Engineering Science

Rogowski Building

Schinkelstr. 2

52062 Aachen

Germany

Office: | Schinkelstr. 2, room 430 |

Email: |

- High-Performance and Parallel Computing
- Tensors
- Machine Learning
- Computer Vision

**CP-CALS**: Software for computing the Canonical Polyadic Decomposition (CPD), also known as PARAllel FACtors (PARAFAC), using the Concurrent Alternating Least Squares Algorithm (CALS).

**Role**: Designer and main developer.**LAMP Benchmark**: A series of benchmarks/tests across multiple high-level linear algebra oriented languages/libraries/frameworks that assess to what extent said languages translate high-level linear algebra expressions to high performance code.

**Role**: Designer and main developer.**Linnea**: Linnea is an experimental tool for the automatic generation of optimized code for linear algebra problems. It is developed at the High-Performance and Automatic Computing group at RWTH Aachen University. An online demo of Linnea can be found here.

**Role**: Contributor.**CodeSummarizer**: Summarizing Software Functionality from Source Code. This project was part of my MS thesis.

**Role**: Designer and main developer.

**Algorithm 1026: Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions**ACM Transactions on Mathematical Software, Volume 48(3), pp. 1-20, September 2022.@article{Psarras2022:980, author = "Christos Psarras and Lars Karlsson and Rasmus Bro and Paolo Bientinesi", title = "Algorithm 1026: Concurrent Alternating Least Squares for multiple simultaneous Canonical Polyadic Decompositions", journal = "ACM Transactions on Mathematical Software", year = 2022, volume = 48, number = 3, pages = "1--20", month = sep, doi = "10.1145/3519383" }

abstractwebbibtexTensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.**The Linear Algebra Mapping Problem. Current state of linear algebra languages and libraries.**ACM Transactions on Mathematical Software, Volume 48(3), pp. 1-30, September 2022.@article{Psarras2022:618, author = "Christos Psarras and Henrik Barthels and Paolo Bientinesi", title = "The Linear Algebra Mapping Problem. Current state of linear algebra languages and libraries.", journal = "ACM Transactions on Mathematical Software", year = 2022, volume = 48, number = 3, pages = "1--30", month = sep, doi = "10.1145/3549935", url = "https://arxiv.org/pdf/1911.09421v2.pdf" }

abstractwebPDFbibtexWe observe a disconnect between the developers and the end users of linear algebra libraries. On the one hand, the numerical linear algebra and the high-performance communities invest significant effort in the development and optimization of highly sophisticated numerical kernels and libraries, aiming at the maximum exploitation of both the properties of the input matrices, and the architectural features of the target computing platform. On the other hand, end users are progressively less likely to go through the error-prone and time consuming process of directly using said libraries by writing their code in C or Fortran; instead, languages and libraries such as Matlab, Julia, Eigen and Armadillo, which offer a higher level of abstraction, are becoming more and more popular. Users are given the opportunity to code matrix computations with a syntax that closely resembles the mathematical description; it is then a compiler or an interpreter that internally maps the input program to lower level kernels, as provided by libraries such as BLAS and LAPACK. Unfortunately, our experience suggests that in terms of performance, this translation is typically vastly suboptimal. In this paper, we first introduce the Linear Algebra Mapping Problem, and then investigate how effectively a benchmark of test problems is solved by popular high-level programming languages. Specifically, we consider Matlab, Octave, Julia, R, Armadillo (C++), Eigen (C++), and NumPy (Python); the benchmark is meant to test both standard compiler optimizations such as common subexpression elimination and loop-invariant code motion, as well as linear algebra specific optimizations such as optimal parenthesization of a matrix product and kernel selection for matrices with properties. The aim of this study is to give concrete guidelines for the development of languages and libraries that support linear algebra computations.**Accelerating jackknife resampling for the Canonical Polyadic Decomposition**Frontiers in Applied Mathematics and Statistics, Volume 8, April 2022.@article{Psarras2022:328, author = "Christos Psarras and Lars Karlsson and Rasmus Bro and Paolo Bientinesi", title = "Accelerating jackknife resampling for the Canonical Polyadic Decomposition", journal = "Frontiers in Applied Mathematics and Statistics", year = 2022, volume = 8, month = apr, doi = "10.3389/fams.2022.830270", url = "https://www.frontiersin.org/articles/10.3389/fams.2022.830270/pdf" }

abstractwebPDFbibtexThe Canonical Polyadic (CP) tensor decomposition is frequently used as a model in applications in a variety of different fields. Using jackknife resampling to estimate parameter uncertainties is often desirable but results in an increase of the already high computational cost. Upon observation that the resampled tensors, though different, are nearly identical, we show that it is possible to extend the recently proposed Concurrent ALS (CALS) technique to a jackknife resampling scenario. This extension gives access to the computational efficiency advantage of CALS for the price of a modest increase (typically a few percent) in the number of floating point operations. Numerical experiments on both synthetic and real-world datasets demonstrate that the new workflow based on a CALS extension can be several times faster than a straightforward workflow where the jackknife submodels are processed individually.**Linnea: Automatic Generation of Efficient Linear Algebra Programs**ACM Transactions on Mathematical Software (TOMS), Volume 47(3), pp. 1-26, June 2021.@article{Barthels2021:688, author = "Henrik Barthels and Christos Psarras and Paolo Bientinesi", title = "Linnea: Automatic Generation of Efficient Linear Algebra Programs", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2021, volume = 47, number = 3, pages = "1--26", month = jun, doi = "10.1145/3446632", url = "https://arxiv.org/pdf/1912.12924.pdf" }

abstractwebPDFbibtexThe translation of linear algebra computations into efficient sequences of library calls is a non-trivial task that requires expertise in both linear algebra and high-performance computing. Almost all high-level languages and libraries for matrix computations (e.g., Matlab, Eigen) internally use optimized kernels such as those provided by BLAS and LAPACK; however, their translation algorithms are often too simplistic and thus lead to a suboptimal use of said kernels, resulting in significant performance losses. In order to combine the productivity offered by high-level languages, and the performance of low-level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem; as output, it returns an efficient sequence of calls to high-performance kernels. Linnea uses a custom best-first search algorithm to find a first solution in less than a second, and increasingly better solutions when given more time. In 125 test problems, the code generated by Linnea almost always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.

**Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorch**Proceedings of the iWAPT, March 2022.@inproceedings{_Sankaran2022:190, author = "Aravind Sankaran and {Navid Akbari} Alashti and Christos Psarras and Paolo Bientinesi", title = "Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorch", year = 2022, month = mar, url = "https://arxiv.org/pdf/2202.09888.pdf" }

abstractwebPDFbibtexLinear algebra operations, which are ubiquitous in machine learning, form major performance bottlenecks. The High-Performance Computing community invests significant effort in the development of architecture-specific optimized kernels, such as those provided by the BLAS and LAPACK libraries, to speed up linear algebra operations. However, end users are progressively less likely to go through the error prone and time-consuming process of directly using said kernels; instead, frameworks such as TensorFlow (TF) and PyTorch (PyT), which facilitate the development of machine learning applications, are becoming more and more popular. Although such frameworks link to BLAS and LAPACK, it is not clear whether or not they make use of linear algebra knowledge to speed up computations. For this reason, in this paper we develop benchmarks to investigate the linear algebra optimization capabilities of TF and PyT. Our analyses reveal that a number of linear algebra optimizations are still missing; for instance, reducing the number of scalar operations by applying the distributive law, and automatically identifying the optimal parenthesization of a matrix chain. In this work, we focus on linear algebra computations in TF and PyT; we both expose opportunities for performance enhancement to the benefit of the developers of the frameworks and provide end users with guidelines on how to achieve performance gains.**Automatic Generation of Efficient Linear Algebra Programs**Proceedings of the Platform for Advanced Scientific Computing Conference (PASC20), June 2020.@inproceedings{Barthels2020:200, author = "Henrik Barthels and Christos Psarras and Paolo Bientinesi", title = "Automatic Generation of Efficient Linear Algebra Programs", booktitle = "Proceedings of the Platform for Advanced Scientific Computing Conference (PASC20)", year = 2020, month = jun, journal = "CoRR" }

abstractwebbibtexThe level of abstraction at which application experts reason about linear algebra computations and the level of abstraction used by developers of high-performance numerical linear algebra libraries do not match. The former is conveniently captured by high-level languages and libraries such as Matlab and Eigen, while the latter expresses the kernels included in the BLAS and LAPACK libraries. Unfortunately, the translation from a high-level computation to an efficient sequence of kernels is a task, far from trivial, that requires extensive knowledge of both linear algebra and high-performance computing. Internally, almost all high-level languages and libraries use efficient kernels; however, the translation algorithms are too simplistic and thus lead to a suboptimal use of said kernels, with significant performance losses. In order to both achieve the productivity that comes with high-level languages, and make use of the efficiency of low level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem and produces as output an efficient sequence of calls to high-performance kernels. In 25 application problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.**A Mechanism for Automatically Summarizing Software Functionality from Source Code**2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS), pp. 121-130, 2019.@inproceedings{Psarras2019:818, author = "Christos Psarras and Themistoklis Diamantopoulos and Andreas Symeonidis", title = "A Mechanism for Automatically Summarizing Software Functionality from Source Code", booktitle = "2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS)", year = 2019, pages = "121--130", doi = "10.1109/QRS.2019.00028" }

abstractwebbibtexWhen developers search online to find software components to reuse, they usually first need to understand the container projects/libraries, and subsequently identify the required functionality. Several approaches identify and summarize the offerings of projects from their source code, however they often require that the developer has knowledge of the underlying topic modeling techniques; they do not provide a mechanism for tuning the number of topics, and they offer no control over the top terms for each topic. In this work, we use a vectorizer to extract information from variable/method names and comments, and apply Latent Dirichlet Allocation to cluster the source code files of a project into different semantic topics. The number of topics is optimized based on their purity with respect to project packages, while topic categories are constructed to provide further intuition and Stack Exchange tags are used to express the topics in more abstract terms.**Surface/subsurface mapping with an integrated rover-GPR system: A simulation approach**2018 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR), pp. 15-22, 2018.@inproceedings{Kouros2018:660, author = "George Kouros and Christos Psarras and Ioannis Kostavelis and Dimitrios Giakoumis and Dimitrios Tzovaras", title = "Surface/subsurface mapping with an integrated rover-GPR system: A simulation approach", booktitle = "2018 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR)", year = 2018, pages = "15--22", doi = "10.1109/SIMPAR.2018.8376265" }

abstractwebbibtexAutonomous subsurface mapping is a key characteristic of future robots to be realized in the construction domain, since it can be utilized in diverse applications of strategic importance. During the last years, the interest has been steered mainly towards the development of ground-penetrating radar (GPR) devices, rather than on the establishment of holistic subsurface reconstruction methods. To this end, the paper at hand introduces a simulation tool that comprises a) a surface operating rover and b) a sonar-based simulated GPR array capable, seamlessly integrated to build adjunct surface and subsurface maps. Specifically, by exploiting the onboard stereo camera of the robot and the GPR, mounted on a robotic-trailer topology, joint surface and subsurface mapping is performed. Further processing of the simulated GPR data is applied to detect and semantically annotate georeferenced buried utilities, while the localization of surface rover is also employed for the topographic correction of the accumulated B-scans during the robot's exploration. The proposed framework has been developed in the ROS framework and has been evaluated on the realistic simulation environment of Gazebo.

**The Landscape of Software for Tensor Computations**June 2022.@techreport{Psarras2022:198, author = "Christos Psarras and Lars Karlsson and Jiajia Li and Paolo Bientinesi", title = "The Landscape of Software for Tensor Computations", year = 2022, month = jun, journal = "arXiv", url = "https://arxiv.org/pdf/2103.13756" }

abstractwebPDFbibtexTensors (also commonly seen as multi-linear operators or asmulti-dimensional arrays) are ubiquitous in scientific computing and in data science, and so are the software efforts for tensor operations. Particularly in recent years, we have observed an explosion in libraries, compilers, packages, and toolboxes; unfortunately these efforts are very much scattered among the different scientific domains, and inevitably suffer from replication, suboptimal implementations, and in many cases, limited visibility. As a first step towards countering these inefficiencies, here we survey and loosely classify software packages related to tensor computations. Our aim is to assemble a comprehensive and up-to-date snapshot of the tensor software landscape, with the intention of helping both users and developers.

**Performance-optimal resource allocation for matrix chain products on multithreaded architectures**

*Alexander Tyson Reeh, Master's Thesis.*

**Predictability of optimal core distribution based on weight and speedup**

*Rasmus Eriksson, Bachelor's Thesis.*

**The MTTKRP, a closer look**Dagstuhl Seminar 22101, Tensor Computations: Applications and Optimization.

March 2022.**Achieving the compute bound with CP-CALS**BLIS Retreat 2020.

5 October 2020.**Achieving the compute bound with Concurrect Alternating Least Squares for the Canonical Polyadic Decomposition (CP-CALS)**IRTG Annual Meeting.

July 2020.**Efficiently Mapping Linear Algebra to High-Performance Code (Poster session)**TACCSTER 2019.

September 2019.**The Linear Algebra Mapping Problem**BLIS Retreat, September 2019.**Model-based Generation of Linear Algebra Software**IRTG Annual Meeting.

July 2019.**Harnessing the power of version control - An introduction to Git**Sustainable Scientific Software Sessions with Snacks, February 2019.abstractIn the first installement of the "Sustainable Scientific Software Sessions with Snacks" (S5) I present a hands on workshop on Git. Git is a distributed version-control system for tracking changes in source code during software development. In the spirit of the S5, I demonstrate how it can be used to develop and manage scientific source code, both individually and as a team.