TUHH Open Research
Help
  • Log In
    New user? Click here to register.Have you forgotten your password?
  • English
  • Deutsch
  • Communities & Collections
  • Publications
  • Research Data
  • People
  • Institutions
  • Projects
  • Statistics
  1. Home
  2. TUHH
  3. Publications
  4. Design of a high-performance tensor–matrix multiplication with BLAS
 
Options

Design of a high-performance tensor–matrix multiplication with BLAS

Citation Link: https://doi.org/10.15480/882.16445
Publikationstyp
Journal Article
Date Issued
2025-03-22
Sprache
English
Author(s)
Başsoy, Cem Savaş  
Eingebettete Systeme E-13  
TORE-DOI
10.15480/882.16445
TORE-URI
https://hdl.handle.net/11420/60738
Journal
Journal of computational science  
Volume
87
Article Number
102568
Citation
Journal of Computational Science 87: 102568 (2025)
Publisher DOI
10.1016/j.jocs.2025.102568
Scopus ID
2-s2.0-105000555227
Publisher
Elsevier
The tensor–matrix multiplication (TTM) is a basic tensor operation required by various tensor methods such as the HOSVD. This paper presents flexible high-performance algorithms that compute the tensor–matrix product according to the Loops-over-GEMM (LOG) approach. The proposed algorithms can process dense tensors with any linear tensor layout, arbitrary tensor order and dimensions all of which can be runtime variable. The paper discusses two slicing methods with orthogonal parallelization strategies and propose four algorithms that call BLAS with subtensors or tensor slices. It also provides a simple heuristic which selects one of the four proposed algorithms at runtime. All algorithms have been evaluated on a large set of tensors with various tensor shapes and linear tensor layouts. In case of large tensor slices, our best-performing algorithm achieves a median performance of 2.47 TFLOPS on an Intel Xeon Gold 5318Y and 2.93 TFLOPS an AMD EPYC 9354. Furthermore, it outperforms batched GEMM implementation of Intel MKL by a factor of 2.57 with large tensor slices. Our runtime tests show that our best-performing algorithm is, on average, at least 6.21% and up to 334.31% faster than frameworks implementing state-of-the-art approaches, including actively developed libraries such as Libtorch and Eigen. For the majority of tensor shapes, it is on par with TBLIS which uses optimized kernels for the TTM computation. Our algorithm performs better than all other competing implementations for the majority of real world tensors from the SDRBench, reaching a speedup of 2x or more for some tensor instances. This work is an extended version of ”Fast and Layout-Oblivious Tensor–Matrix Multiplication with BLAS” (Başsoy 2024).
Subjects
High-performance computing
Tensor contraction
Tensor methods
Tensor-times-matrix multiplication
DDC Class
006: Special computer methods
518: Numerical Analysis
Funding(s)
Projekt DEAL  
Lizenz
https://creativecommons.org/licenses/by/4.0/
Publication version
publishedVersion
Loading...
Thumbnail Image
Name

1-s2.0-S1877750325000456-main.pdf

Size

1.47 MB

Format

Adobe PDF

TUHH
Weiterführende Links
  • Contact
  • Send Feedback
  • Cookie settings
  • Privacy policy
  • Impress
DSpace Software

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science
Design by effective webwork GmbH

  • Deutsche NationalbibliothekDeutsche Nationalbibliothek
  • ORCiD Member OrganizationORCiD Member Organization
  • DataCiteDataCite
  • Re3DataRe3Data
  • OpenDOAROpenDOAR
  • OpenAireOpenAire
  • BASE Bielefeld Academic Search EngineBASE Bielefeld Academic Search Engine
Feedback