High-Performance Tensor Contractions
Publications and Talks
Peer Reviewed Conference Publication
- On the Performance Prediction of BLAS-based Tensor ContractionsHigh Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Lecture Notes in Computer Science, Volume 8966, pp. 193-212, Springer International Publishing, April 2015.
@inproceedings{Peise2015:380, author = "Elmar Peise and Diego Fabregat-Traver and Paolo Bientinesi", title = "On the Performance Prediction of BLAS-based Tensor Contractions", booktitle = "High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation", year = 2015, editor = "Jarvis, Stephen A. and Wright, Steven A. and Hammond, Simon D.", volume = 8966, series = "Lecture Notes in Computer Science", pages = "193-212", month = apr, publisher = "Springer International Publishing", doi = "10.1007/978-3-319-17248-4_10", url = "http://arxiv.org/pdf/1409.8608v1" }
abstractwebPDFbibtexTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.
Talks
- On the Performance Prediction of BLAS-based Tensor Contractions5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS14).
SC14, New Orleans, LA, USA, 16 November 2014.abstractwebPDFTensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms. - Estimating the Efficiency of BLAS-based Tensor ContractionsAnnual Report 2.
AICES, RWTH Aachen, 6 November 2014.