Benchmarking Thread Block Cluster

Lühnen, Tim JuliusTim JuliusLühnenMarschner, TobiasTobiasMarschnerLal, SohanSohanLal2024-09-162024-09-16202428th Annual IEEE High Performance Extreme Computing Conference, HPEC 2024https://hdl.handle.net/11420/49014Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high-performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: the thread block cluster (TBC). This feature enables the grouping of thread blocks, facilitating direct communication and synchronization between them. To support this, a dedicated SM-to-SM network was integrated, connecting streaming multiprocessors (SMs) to facilitate efficient inter-block communication. This paper delves into the performance characteristics of this new feature, specifically examining the latencies developers can anticipate when utilizing the direct communication channel provided by TBCs. We present an analysis of the SM-to-SM network behavior, which is crucial for developing accurate analytical and cycle-accurate simulation models. Our study includes a comprehensive evaluation of the impact of TBCs on application performance, highlighting scenarios where this feature can lead to significant improvements. For instance, applications where a data-producing thread block writes data directly into the shared memory of the consuming thread block can be up to 2.3× faster than using global memory for data transfer. Additionally, applications constrained by shared memory can achieve up to a 2.1× speedup by employing TBCs. Our findings also reveal that utilizing large cluster dimensions can result in an execution time overhead exceeding 20%. By exploring the intricacies of the Hopper architecture and its new TBC feature, this paper equips developers with the knowledge needed to harness the full potential of modern GPUs and assists researchers in developing accurate analytical and cycle-accurate simulation models.enhttp://rightsstatements.org/vocab/InC/1.0/CUDABenchmarkingHopper GPUThread block clusterComputer Science, Information and General Works::004: Computer SciencesComputer Science, Information and General Works::006: Special computer methods::006.3: Artificial IntelligenceNatural Sciences and Mathematics::519: Applied Mathematics, ProbabilitiesBenchmarking Thread Block ClusterConference Paper10.15480/882.1327910.15480/882.13279Conference Paper