Dr. Diego Fabregat Traver
Contact
Diego Fabregat Traver
Aachen Institute for Advanced Study in
Computational Engineering Science
Rogowski Building
Schinkelstr. 2
52062 Aachen
Germany
Office:  Schinkelstr. 2a (GRS), 2nd floor, room 207 
Phone:  +49 241 80 99 128 
Fax:  +49 241 80 628498 
Email: 
Education
09/2009  12/2013  Dr. rer. nat. in Computer Science, Aachen Institute for Advanced Study in Computational Engineering Science (AICES), RWTH Aachen University. 

09/2003  09/2008  Computer Sciences and Engineering (equiv. to M.S.) at Universitat Jaume I, Castellón, Spain. 
Professional Career
Since 12/2013  Postdoctoral Research Associate at AICES 

01/2009  07/2009 
Research assistant at Universitat Jaume I, Spain. Design and development of a computing library for coprocessors. 
Research Interests
 Automation
 Domain Specific Compilers
 Numerical Linear Algebra
 High Performance
Software
 OmicABEL: Highperformance solvers for mixedmodels GWAS. Distributed as part of the GenABEL suite for statistical genomics. Role: Designer and main developer.
 lwaio: Lightweight asynchronous input/output library. Role: Designer and developer.
Teaching
Publications
Journal Articles
 Accelerating scientific codes by performance and accuracy modeling Journal of Computational Science, April 2018.
Accepted.@article{FabregatTraver2018:408, author = "Diego FabregatTraver and {Ahmed E.} Ismail and Paolo Bientinesi", title = "Accelerating scientific codes by performance and accuracy modeling ", journal = "Journal of Computational Science", year = 2018, month = apr, note = "Accepted", url = "http://arxiv.org/abs/1608.04694" }
abstractPDFbibtexScientific software is often driven by multiple parameters that affect both accuracy and performance. Since finding the optimal configuration of these parameters is a highly complex task, it extremely common that the software is used suboptimally. In a typical scenario, accuracy requirements are imposed, and attained through suboptimal performance. In this paper, we present a methodology for the automatic selection of parameters for simulation codes, and a corresponding prototype tool. To be amenable to our methodology, the target code must expose the parameters affecting accuracy and performance, and there must be formulas available for error bounds and computational complexity of the underlying methods. As a case study, we consider the particleparticle particlemesh method (PPPM) from the LAMMPS suite for molecular dynamics, and use our tool to identify configurations of the input parameters that achieve a given accuracy in the shortest execution time. When compared with the configurations suggested by expert users, the parameters selected by our tool yield reductions in the timetosolution ranging between 10% and 60%. In other words, for the typical scenario where a fixed number of corehours are granted and simulations of a fixed number of timesteps are to be run, usage of our tool may allow up to twice as many simulations. While we develop our ideas using LAMMPS as computational framework and use the PPPM method for dispersion as case study, the methodology is general and valid for a range of software tools and methods.  LargeScale Linear Regression: Development of HighPerformance RoutinesApplied Mathematics and Computation, Volume 275, pp. 411421, 15 February 2016.
@article{Frank2016:90, author = "Alvaro Frank and Diego FabregatTraver and Paolo Bientinesi", title = "LargeScale Linear Regression: Development of HighPerformance Routines", journal = "Applied Mathematics and Computation", year = 2016, volume = 275, pages = "411421", month = feb, url = "http://arxiv.org/abs/1504.07890" }
abstractwebPDFbibtexIn statistics, series of ordinary least squares problems (OLS) are used to study the linear correlation among sets of variables of interest; in many studies, the number of such variables is at least in the millions, and the corresponding datasets occupy terabytes of disk space. As the availability of largescale datasets increases regularly, so does the challenge in dealing with them. Indeed, traditional solverswhich rely on the use of ``blackbox'' routines optimized for one single OLSare highly inefficient and fail to provide a viable solution for bigdata analyses. As a case study, in this paper we consider a linear regression consisting of twodimensional grids of related OLS problems that arise in the context of genomewide association analyses, and give a careful walkthrough for the development of {\sc olsgrid}, a highperformance routine for sharedmemory architectures; analogous steps are relevant for tailoring OLS solvers to other applications. In particular, we first illustrate the design of efficient algorithms that exploit the structure of the OLS problems and eliminate redundant computations; then, we show how to effectively deal with datasets that do not fit in main memory; finally, we discuss how to cast the computation in terms of efficient kernels and how to achieve scalability. Importantly, each design decision along the way is justified by simple performance models. {\sc olsgrid} enables the solution of $10^{11}$ correlated OLS problems operating on terabytes of data in a matter of hours.  High Performance Solutions for Bigdata GWASParallel Computing, Volume 42, pp. 75  87, February 2015.
Special issue on Parallelism in Bioinformatics.@article{Peise2015:754, author = "Elmar Peise and Diego FabregatTraver and Paolo Bientinesi", title = "High Performance Solutions for Bigdata GWAS", journal = "Parallel Computing", year = 2015, volume = 42, pages = "75  87", month = feb, note = "Special issue on Parallelism in Bioinformatics", url = "http://arxiv.org/pdf/1403.6426v1" }
abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genomewide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: The genotype and phenotype data is streamed from secondary storage using a double buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain highperformance and allow the analysis of enormous datasets.  BigData, HighPerformance, Mixed Models Based GenomeWide Association AnalysisF1000Research, Volume 3(200), August 2014.
Open peer reviews.@article{FabregatTraver2014:430, author = "Diego FabregatTraver and Sodbo Sharapov and {Caroline } Hayward and {Igor } Rudan and {Harry } Campbell and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "BigData, HighPerformance, Mixed Models Based GenomeWide Association Analysis", journal = "F1000Research", year = 2014, volume = 3, number = 200, month = aug, note = "Open peer reviews", howpublished = "F1000Research" }
abstractwebbibtexTo raise the power of genomewide association studies (GWAS) and avoid falsepositive results, one can rely on mixed model based tests. When large samples are used, and especially when multiple traits are to be studied in the omics context, this approach becomes computationally unmanageable. Here, we develop new methods for analysis of arbitrary number of traits, and demonstrate that for the analysis of singletrait and multipletrait GWAS, different methods are optimal. We implement these methods in a highperformance computing framework that uses stateoftheart linear algebra kernels, incorporates optimizations, and avoids redundant computations, thus increasing throughput while reducing memory usage and energy consumption. We show that compared to existing libraries, the OmicABEL softwarewhich implements these methodsachieves speedups of up to three orders of magnitude. As a consequence, samples of tens of thousands of individuals as well as samples phenotyped for many thousands of ''omics'' measurements can be analyzed for association with millions of omics features without the need for supercomputers.  Computing Petaflops over Terabytes of Data: The Case of GenomeWide Association StudiesACM Transactions on Mathematical Software (TOMS), Volume 40(4), pp. 27:127:22, June 2014.
@article{FabregatTraver2014:798, author = "Diego FabregatTraver and Paolo Bientinesi", title = "Computing Petaflops over Terabytes of Data: The Case of GenomeWide Association Studies", journal = "ACM Transactions on Mathematical Software (TOMS)", year = 2014, volume = 40, number = 4, pages = "27:127:22", month = jun, publisher = "ACM", url = "http://arxiv.org/pdf/1210.7683v1" }
abstractwebPDFbibtexIn many scientific and engineering applications one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by the GenomeWide Association Studies (GWAS), a widely spread tool in computational biology. GWAS entails the solution of up to trillions ($10^{12}$) of correlated generalized leastsquares problems, posing a daunting challenge: the performance of petaflops ($10^{15}$ floatingpoint operations) over terabytes of data residing on disk. In this paper, we design an efficient algorithm for performing GWAS on multicore architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems, thus reducing the overall computational complexity. Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be distributed among the available cores, to attain high performance and scalability. We believe the paper contributes valuable guidelines of general applicability for computational scientists on how to develop and optimize numerical algorithms.  Towards an Efficient Use of the BLAS Library for Multilinear Tensor ContractionsApplied Mathematics and Computation, Volume 235, pp. 454468, May 2014.
@article{Di_Napoli2014:210, author = "Edoardo {Di Napoli} and Diego FabregatTraver and Gregorio QuintanaOrti and Paolo Bientinesi", title = "Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions", journal = "Applied Mathematics and Computation", year = 2014, volume = 235, pages = "454468", month = may, publisher = "Elsevier", url = "http://arxiv.org/pdf/1307.2100" }
abstractwebPDFbibtexMathematical operators whose transformation rules constitute the building blocks of a multilinear algebra are widely used in physics and engineering applications where they are very often represented as tensors. In the last century, thanks to the advances in tensor calculus, it was possible to uncover new research fields and make remarkable progress in the existing ones, from electromagnetism to the dynamics of fluids and from the mechanics of rigid bodies to quantum mechanics of many atoms. By now, the formal mathematical and geometrical properties of tensors are well defined and understood; conversely, in the context of scientific and highperformance computing, many tensorrelated problems are still open. In this paper, we address the problem of efficiently computing contractions among two tensors of arbitrary dimension by using kernels from the highly optimized BLAS library. In particular, we establish precise conditions to determine if and when GEMM, the kernel for matrix products, can be used. Such conditions take into consideration both the nature of the operation and the storage scheme of the tensors, and induce a classification of the contractions into three groups. For each group, we provide a recipe to guide the users towards the most effective use of BLAS.  Solving Sequences of Generalized LeastSquares Problems on Multithreaded ArchitecturesApplied Mathematics and Computation, Volume 234, pp. 606617, May 2014.
@article{FabregatTraver2014:430, author = "Diego FabregatTraver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Solving Sequences of Generalized LeastSquares Problems on Multithreaded Architectures", journal = "Applied Mathematics and Computation", year = 2014, volume = 234, pages = "606617", month = may, url = "http://arxiv.org/pdf/1210.7325v1" }
abstractwebPDFbibtexGeneralized linear mixedeffects models in the context of genomewide association studies (GWAS) represent a formidable computational challenge: the solution of millions of correlated generalized leastsquares problems, and the processing of terabytes of data. We present high performance incore and outofcore sharedmemory algorithms for GWAS: By taking advantage of domainspecific knowledge, exploiting multicore parallelism, and handling data efficiently, our algorithms attain unequalled performance. When compared to GenABEL, one of the most widely used libraries for GWAS, on a 12core processor we obtain 50fold speedups. As a consequence, our routines enable genome studies of unprecedented size.  Applicationtailored Linear Algebra Algorithms: A SearchBased ApproachInternational Journal of High Performance Computing Applications (IJHPCA), Volume 27(4), pp. 425  438, November 2013.
@article{FabregatTraver2013:4, author = "Diego FabregatTraver and Paolo Bientinesi", title = "Applicationtailored Linear Algebra Algorithms: A SearchBased Approach", journal = "International Journal of High Performance Computing Applications (IJHPCA)", year = 2013, volume = 27, number = 4, pages = "425  438", month = nov, publisher = "Sage Publications, Inc.", url = "https://arxiv.org/pdf/1211.5904.pdf" }
abstractwebPDFbibtexIn this paper, we tackle the problem of automatically generating algorithms for linear algebra operations by taking advantage of problemspecific knowledge. In most situations, users possess much more information about the problem at hand than what current libraries and computing environments accept; evidence shows that if properly exploited, such information leads to uncommon/unexpected speedups. We introduce a knowledgeaware linear algebra compiler that allows users to input matrix equations together with properties about the operands and the problem itself; for instance, they can specify that the equation is part of a sequence, and how successive instances are related to one another. The compiler exploits all this information to guide the generation of algorithms, to limit the size of the search space, and to avoid redundant computations. We applied the compiler to equations arising as part of sensitivity and genome studies; the algorithms produced exhibit, respectively, 100 and 1000fold speedups.
Peer Reviewed Conference Publications
 Program Generation for SmallScale Linear Algebra ApplicationsProceedings of the International Symposium on Code Generation and Optimization, February 2018.
@inproceedings{Spampinato2018:858, author = "{Daniele } Spampinato and Diego FabregatTraver and Paolo Bientinesi and Markus Pueschel", title = "Program Generation for SmallScale Linear Algebra Applications", year = 2018, address = "Vienna, Austria", month = feb, url = "https://arxiv.org/pdf/1805.04775.pdf" }
abstractwebPDFbibtexWe present SLinGen, a program generation system for linear algebra. The input to SLinGen is an application expressed mathematically in a linearalgebrainspired language (LA) that we define. LA provides basic scalar/vector/matrix additions/multiplications, higher level operations including linear systems solvers, Cholesky and LU factorizations, as well as loops. The output of SLinGen is performanceoptimized singlesource C code, optionally vectorized with intrinsics. The target of SLinGen are smallscale computations on fixedsize operands, for which a straightforward implementation using optimized libraries (e.g., BLAS or LAPACK) is known to yield suboptimal performance (besides increasing code size and introducing dependencies), but which are crucial in control, signal processing, computer vision, and other domains. Internally, SLinGen uses synthesis and DSLbased techniques to optimize at a high level of abstraction. We benchmark our program generator on three prototypical applications: the Kalman filter, Gaussian process regression, and an L1analysis convex solver, as well as basic routines including Cholesky factorization and solvers for the continuoustime Lyapunov and Sylvester equations. The results show significant speedups compared to straightforward C with icc/clang or a polyhedral optimizer, as well as librarybased and templatebased implementations.  Hybrid CPUGPU generation of the Hamiltonian and Overlap matrices in FLAPW methodsProceedings of the JARAHPC Symposium, Lecture Notes in Computer Science, Volume 10164, pp. 200211, Springer, 2017.
@inproceedings{FabregatTraver2017:4, author = "Diego FabregatTraver and Davor Davidovic and Markus Höhnerbach and Edoardo {Di Napoli}", title = " Hybrid CPUGPU generation of the Hamiltonian and Overlap matrices in FLAPW methods", year = 2017, volume = 10164, series = "Lecture Notes in Computer Science", pages = "200211", publisher = "Springer", url = "https://arxiv.org/pdf/1611.00606v1" }
abstractwebPDFbibtexIn this paper we focus on the integration of highperformance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum J\"ulich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by reengineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable sharedmemory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to offload parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized sharedmemory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code.  On the Performance Prediction of BLASbased Tensor ContractionsHigh Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193212, Springer International Publishing, April 2015.
@inproceedings{Peise2015:380, author = "Elmar Peise and Diego FabregatTraver and Paolo Bientinesi", title = "On the Performance Prediction of BLASbased Tensor Contractions", booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation", year = 2015, editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.", volume = 8966, series = "Lecture Notes in Computer Science", pages = "193212", month = apr, publisher = "Springer International Publishing", url = "http://arxiv.org/pdf/1409.8608v1" }
abstractwebPDFbibtexTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of highperformance kernels for such operations is known to be a challenging task. While for operations on one and twodimensional tensors there exist standardized interfaces and highlyoptimized libraries (BLAS), for higher dimensional tensors neither standards nor highlytuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating highperformance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cacheaware microbenchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the bestperforming algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.  Algorithms for Largescale Whole Genome Association Analysis Proceedings of the 20th European MPI Users' Group Meeting, EuroMPI '13, pp. 229234, ACM, 2013.
@inproceedings{Peise2013:504, author = "Elmar Peise and Diego FabregatTraver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "Algorithms for Largescale Whole Genome Association Analysis ", booktitle = "Proceedings of the 20th European MPI Users' Group Meeting", year = 2013, series = "EuroMPI '13", pages = "229234 ", address = "New York, NY, USA", publisher = "ACM", url = "http://arxiv.org/pdf/1304.2272v1" }
abstractwebPDFbibtexIn order to associate complex traits with genetic polymorphisms, genomewide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain highperformance and allow the analysis of enormous datasets.  A DomainSpecific Compiler for Linear Algebra OperationsHigh Performance Computing for Computational Science  VECPAR 2012, Lecture Notes in Computer Science, Volume 7851, pp. 346361, Springer, 2013.
@inproceedings{FabregatTraver2013:340, author = "Diego FabregatTraver and Paolo Bientinesi", title = "A DomainSpecific Compiler for Linear Algebra Operations", booktitle = "High Performance Computing for Computational Science  VECPAR 2012", year = 2013, editor = "M. Dayde, O. Marques, and K. Nakajima", volume = 7851, series = "Lecture Notes in Computer Science", pages = "346361", address = "Heidelberg", publisher = "Springer", url = "http://arxiv.org/pdf/1205.5975v1" }
abstractwebPDFbibtexWe present a prototypical linear algebra compiler that automatically exploits domainspecific knowledge to generate highperformance algorithms. The input to the compiler is a target equation together with knowledge of both the structure of the problem and the properties of the operands. The output is a variety of highperformance algorithms to solve the target equation. Our approach consists in the decomposition of the input equation into a sequence of librarysupported kernels. Since in general such a decomposition is not unique, our compiler returns not one but a number of algorithms. The potential of the compiler is shown by means of its application to a challenging equation arising within the genomewide association study. As a result, the compiler produces multiple "best" algorithms that outperform the best existing libraries.  Automatic Generation of LoopInvariants for Matrix OperationsComputational Science and its Applications, International Conference, pp. 8292, IEEE Computer Society, 2011.
@inproceedings{FabregatTraver2011:238, author = "Diego FabregatTraver and Paolo Bientinesi", title = "Automatic Generation of LoopInvariants for Matrix Operations", booktitle = "Computational Science and its Applications, International Conference", year = 2011, pages = "8292", address = "Los Alamitos, CA, USA", publisher = "IEEE Computer Society", url = "http://hpac.cs.umu.se/aices/preprint/documents/AICES20110201.pdf" }
abstractwebPDFbibtexIn recent years it has been shown that for many linear algebra operations it is possible to create families of algorithms following a very systematic procedure. We do not refer to the fine tuning of a known algorithm, but to a methodology for the actual generation of both algorithms and routines to solve a given target matrix equation. Although systematic, the methodology relies on complex algebraic manipulations and nonobvious pattern matching, making the procedure challenging to be performed by hand, our goal is the development of a fully automated system that from the sole description of a target equation creates multiple algorithms and routines. We present CL1ck, a symbolic system written in Mathematica, that starts with an equation, decomposes it into multiple equations, and returns a set of loopinvariants for the algorithms  yet to be generated  that will solve the equation. In a successive step each loopinvariant is then mapped to its corresponding algorithm and routine. For a large class of equations, the methodology generates known algorithms as well as many previously unknown ones. Most interestingly, the methodology unifies algorithms traditionally developed in isolation. As an example, the five well known algorithms for the LU factorization are for the first time unified under a common root.  KnowledgeBased Automatic Generation of Partitioned Matrix ExpressionsComputer Algebra in Scientific Computing, Lecture Notes in Computer Science, Volume 6885, pp. 144157, Springer, 2011.
@inproceedings{FabregatTraver2011:54, author = "Diego FabregatTraver and Paolo Bientinesi", title = "KnowledgeBased Automatic Generation of Partitioned Matrix Expressions", booktitle = "Computer Algebra in Scientific Computing", year = 2011, editor = "Gerdt, Vladimir and Koepf, Wolfram and Mayr, Ernst and Vorozhtsov, Evgenii", volume = 6885, series = "Lecture Notes in Computer Science", pages = "144157", address = "Heidelberg", publisher = "Springer", url = "http://hpac.cs.umu.se/aices/preprint/documents/AICES20110103.pdf" }
abstractwebPDFbibtexIn a series of papers it has been shown that for many linear algebra operations it is possible to generate families of algorithms by following a systematic procedure. Although powerful, such a methodology involves complex algebraic manipulation, symbolic computations and pattern matching, making the generation a process challenging to be performed by hand. We aim for a fully automated system that from the sole description of a target operation creates multiple algorithms without any human intervention. Our approach consists of three main stages. The first stage yields the core object for the entire process, the Partitioned Matrix Expression (PME), which establishes how the target problem may be decomposed in terms of simpler subproblems. In the second stage the PME is inspected to identify predicates, the LoopInvariants, to be used to set up the skeleton of a family of proofs of correctness. In the third and last stage the actual algorithms are constructed so that each of them satisfies its corresponding proof of correctness. In this paper we focus on the first stage of the process, the automatic generation of Partitioned Matrix Expressions. In particular, we discuss the steps leading to a PME and the knowledge necessary for a symbolic system to perform such steps. We also introduce Cl1ck, a prototype system written in Mathematica that generates PMEs automatically.  Automatic Generation of Partitioned Matrix Expressions for Matrix Operations.Proceedings of the 8th International Conference of Numerical Analysis and Applied Mathematics (ICNAAM 2010), AIP Conference Proceedings, Volume 1281, pp. 774777, 2010.
@inproceedings{FabregatTraver2010:304, author = "Diego FabregatTraver and Paolo Bientinesi", title = "Automatic Generation of Partitioned Matrix Expressions for Matrix Operations.", year = 2010, volume = 1281, series = "AIP Conference Proceedings", pages = "774777", url = "http://hpac.cs.umu.se/aices/preprint/documents/AICES20101001.pdf" }
abstractwebPDFbibtexWe target the automatic generation of formally correct algorithms and routines for linear algebra operations. Given the broad variety of architectures and configurations with which scientists deal, there does not exist one algorithmic variant that is suitable for all scenarios. Therefore, we aim to generate a family of algorithmic variants to attain highperformance for a broad set of scenarios. One of the authors has previously demonstrated that automatic derivation of a family of algorithms is possible when the Partitioned Matrix Expression (PME) of the target operation is available. The PME is a recursive definition that states the relations between submatrices in the input and the output operands. In this paper we describe all the steps involved in the automatic derivation of PMEs, thus making progress towards a fully automated system.
Thesis
 KnowledgeBased Automatic Generation of Linear Algebra Algorithms and CodeRWTH Aachen, April 2014.
@phdthesis{FabregatTraver2014:278, author = "Diego FabregatTraver", title = " KnowledgeBased Automatic Generation of Linear Algebra Algorithms and Code", school = "RWTH Aachen", year = 2014, month = apr, url = "http://arxiv.org/abs/1404.3406" }
abstractPDFbibtexThis dissertation focuses on the design and the implementation of domainspecific compilers for linear algebra matrix equations. The development of efficient libraries for such equations, which lie at the heart of most software for scientific computing, is a complex process that requires expertise in a variety of areas, including the application domain, algorithms, numerical analysis and highperformance computing. Moreover, the process involves the collaboration of several people for a considerable amount of time. With our compilers, we aim to relieve the developers from both designing algorithms and writing code, and to generate routines that match or even surpass the performance of those written by human experts.
Other
 Program Generation for Linear Algebra Using Multiple Layers of DSLsProceedings of the DSLDI 2016, Extended abstract, October 2016.bibtex
@misc{Spampinato2016:960, author = "{Daniele } Spampinato and Diego FabregatTraver and Markus Pueschel and Paolo Bientinesi", title = "Program Generation for Linear Algebra Using Multiple Layers of DSLs", month = oct, year = 2016, type = "Extended abstract" }
Technical Report
 HighThroughput GenomeWide Association Analysis for Single and Multiple PhenotypesAachen Institute for Computational Engineering Science, RWTH Aachen, July 2012.
Technical Report AICES2012/071.@techreport{FabregatTraver2012:20, author = "Diego FabregatTraver and {Yurii S.} Aulchenko and Paolo Bientinesi", title = "HighThroughput GenomeWide Association Analysis for Single and Multiple Phenotypes", institution = "Aachen Institute for Computational Engineering Science, RWTH Aachen", year = 2012, month = jul, note = "Technical Report AICES2012/071", url = "http://arxiv.org/abs/1207.2169" }
abstractPDFbibtexThe variance component tests used in genomewide association studies of thousands of individuals become computationally exhaustive when multiple traits are analysed in the context of omics studies. We introduce two highthroughput algorithms  CLAKCHOL and CLAKEIG  for single and multiple phenotype genomewide association studies (GWAS). The algorithms, generated with the help of an expert system, reduce the computational complexity to the point that thousands of traits can be analyzed for association with millions of polymorphisms in a course of days on a standard workstation. By taking advantage of problem specific knowledge, CLAKCHOL and CLAKEIG significantly outperform the current stateoftheart tools in both single and multiple trait analysis.
Talks
 Hybrid CPUGPU generation of the Hamiltonian and Overlap matrices in FLAPW methodsJHPCS'16.
4 October 2016.  Cl1ck + LGen: FLAME for small scale linear algebra BLIS Retreat 2016.
University of Texas at Austin, 19 September 2016.  Cl1ck: A code generator for linear algebra kernelsProgramming Languages Lunch Colloquium.
University of Texas at Austin, 12 September 2016.abstractwebPDFWe present Cl1ck, a code generator for specialized linear algebra kernels. Cl1ck adopts the FLAME methodology for the derivation of formally correct loopbased algorithms, and takes a threestage approach: First, the input operation is transformed into one or more Partitioned Matrix Expressions (PMEs), i.e., a recursive definition of the operation; then, the PMEs are decomposed to identify a family of loop invariants; finally, loopbased algorithms are built around these loop invariants using formal methods techniques. Different backends enable then the translation of the algorithms into Matlab and optimized C code.  OmicABEL: Story of a successful interdisciplinary collaborationPASC Conference 14.
ETH Zürich, Zürich, Switzerland, June 2014.  KnowledgeBased Automatic Generation of Algorithms and CodePh.D. Defense, Aachen, Germany, December 2013.
 Automating the Generation of Algorithms for Generalized LeastSquares ProblemsThe 6th European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2012).
Vienna, Austria, September 2012.  A DomainSpecific Compiler for Linear Algebra OperationsThe 7th Intl Workshop on Automatic Performance Tuning (iWAPT), 10th Intl Meeting on HighPerformance Computing for Computational Science (VECPAR 2012).
Kobe, Japan, July 2012.  Highthroughput Algorithms for GenomeWide Association StudiesThe 8th International Conference on Bioinformatics of Genome Regulation and Structure\Systems Biology (BGRS\SB 2012).
Novosibirsk, Russia, June 2012.  Sequences of Generalized LeastSquares for GenomeWide Association StudiesThe 83rd Annual Meeting of the International Association of Applied Mathematics and Mechanics (GAMM 2012).
Darmstadt, Germany, March 2012.  KnowledgeBased Automatic Generation of Partitioned Matrix ExpressionsThe 13th International Workshop on Computer Algebra in Scientific Computing (CASC 2011).
Kassel, Germany, September 2011.  Automatic Generation of LoopInvariants for Matrix OperationsWorkshop on Computer Algebra Systems and Their Applications (CASA), 11th International Conference on Computational Science and Its Applications (ICCSA 2011).
Santander, Spain, June 2011.  Automatic Generation of Partitioned Matrix Expressions for Matrix OperationsWorkshop on Automated computing, 8th International Conference of Numerical Analysis and Applied Mathematics (ICNAAM 2010)..
Rhodes, Greece, September 2010.  Generation of Linear Algebra Algorithms for Automatic DifferentiationWorkshop on Autotuning, 6th Parallel Matrix Algorithms and Applications (PMAA 2010).
Basel, Switzerland, June 2010.
Posters

Mechanical Generation of Algorithms for Automatic Differentiation. [PDF]
Conference on Numerical Linear Algebra: Perturbation, Performance and Portability in celebration of G.W. (Pete) Stewart's 70th Birthday.
July 2010, Austin, TX, USA.