Benchmarking Thread Block Cluster

Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high-performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: the thread block cluster (TBC). This feature enables the grouping of thread blocks, facilitating direct communication and synchronization between them. To support this, a dedicated SM-to-SM network was integrated, connecting streaming multiprocessors (SMs) to facilitate efficient inter-block communication. This paper delves into the performance characteristics of this new feature, specifically examining the latencies developers can anticipate when utilizing the direct communication channel provided by TBCs. We present an analysis of the SM-to-SM network behavior, which is crucial for developing accurate analytical and cycle-accurate simulation models. Our study includes a comprehensive evaluation of the impact of TBCs on application performance, highlighting scenarios where this feature can lead to significant improvements. For instance, applications where a data-producing thread block writes data directly into the shared memory of the consuming thread block can be up to 2.3× faster than using global memory for data transfer. Additionally, applications constrained by shared memory can achieve up to a 2.1× speedup by employing TBCs. Our findings also reveal that utilizing large cluster dimensions can result in an execution time overhead exceeding 20%. By exploring the intricacies of the Hopper architecture and its new TBC feature, this paper equips developers with the knowledge needed to harness the full potential of modern GPUs and assists researchers in developing accurate analytical and cycle-accurate simulation models.

Subjects

CUDA

Benchmarking

Hopper GPU

Thread block cluster

DDC Class

004: Computer Sciences

006.3: Artificial Intelligence

519: Applied Mathematics, Probabilities

Publication version

draft

Lizenz

http://rightsstatements.org/vocab/InC/1.0/

Name

preprint_HPEC2024.pdf

Type

Main Article

Size

437.72 KB

Format

Adobe PDF

Options

Benchmarking Thread Block Cluster