Options
Benchmarking Thread Block Cluster
Citation Link: https://doi.org/10.15480/882.13279
Publikationstyp
Conference Paper
Date Issued
2024
Sprache
English
TORE-DOI
Citation
28th Annual IEEE High Performance Extreme Computing Conference, HPEC 2024
Contribution to Conference
Peer Reviewed
true
Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high-performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: the thread block cluster (TBC). This feature enables the grouping of thread blocks, facilitating direct communication and synchronization between them. To support this, a dedicated SM-to-SM network was integrated, connecting streaming multiprocessors (SMs) to facilitate efficient inter-block communication. This paper delves into the performance characteristics of this new feature, specifically examining the latencies developers can anticipate when utilizing the direct communication channel provided by TBCs. We present an analysis of the SM-to-SM network behavior, which is crucial for developing accurate analytical and cycle-accurate simulation models. Our study includes a comprehensive evaluation of the impact of TBCs on application performance, highlighting scenarios where this feature can lead to significant improvements. For instance, applications where a data-producing thread block writes data directly into the shared memory of the consuming thread block can be up to 2.3× faster than using global memory for data transfer. Additionally, applications constrained by shared memory can achieve up to a 2.1× speedup by employing TBCs. Our findings also reveal that utilizing large cluster dimensions can result in an execution time overhead exceeding 20%. By exploring the intricacies of the Hopper architecture and its new TBC feature, this paper equips developers with the knowledge needed to harness the full potential of modern GPUs and assists researchers in developing accurate analytical and cycle-accurate simulation models.
Subjects
CUDA
Benchmarking
Hopper GPU
Thread block cluster
DDC Class
004: Computer Sciences
006.3: Artificial Intelligence
519: Applied Mathematics, Probabilities
Publication version
draft
Loading...
Name
preprint_HPEC2024.pdf
Type
Main Article
Size
437.72 KB
Format
Adobe PDF