TUHH Open Research
Help
  • Log In
    New user? Click here to register.Have you forgotten your password?
  • English
  • Deutsch
  • Communities & Collections
  • Publications
  • Research Data
  • People
  • Institutions
  • Projects
  • Statistics
  1. Home
  2. TUHH
  3. Publications
  4. Benchmarking Thread Block Cluster
 
Options

Benchmarking Thread Block Cluster

Citation Link: https://doi.org/10.15480/882.13279
Publikationstyp
Conference Paper
Date Issued
2024
Sprache
English
Author(s)
Lühnen, Tim Julius  
Massively Parallel Systems E-EXK5  
Marschner, Tobias  
Massively Parallel Systems E-EXK5  
Lal, Sohan  
Massively Parallel Systems E-EXK5  
TORE-DOI
10.15480/882.13279
TORE-URI
https://hdl.handle.net/11420/49014
Citation
28th Annual IEEE High Performance Extreme Computing Conference, HPEC 2024
Contribution to Conference
28th Annual IEEE High Performance Extreme Computing Conference, HPEC 2024  
Peer Reviewed
true
Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high-performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: the thread block cluster (TBC). This feature enables the grouping of thread blocks, facilitating direct communication and synchronization between them. To support this, a dedicated SM-to-SM network was integrated, connecting streaming multiprocessors (SMs) to facilitate efficient inter-block communication. This paper delves into the performance characteristics of this new feature, specifically examining the latencies developers can anticipate when utilizing the direct communication channel provided by TBCs. We present an analysis of the SM-to-SM network behavior, which is crucial for developing accurate analytical and cycle-accurate simulation models. Our study includes a comprehensive evaluation of the impact of TBCs on application performance, highlighting scenarios where this feature can lead to significant improvements. For instance, applications where a data-producing thread block writes data directly into the shared memory of the consuming thread block can be up to 2.3× faster than using global memory for data transfer. Additionally, applications constrained by shared memory can achieve up to a 2.1× speedup by employing TBCs. Our findings also reveal that utilizing large cluster dimensions can result in an execution time overhead exceeding 20%. By exploring the intricacies of the Hopper architecture and its new TBC feature, this paper equips developers with the knowledge needed to harness the full potential of modern GPUs and assists researchers in developing accurate analytical and cycle-accurate simulation models.
Subjects
CUDA
Benchmarking
Hopper GPU
Thread block cluster
DDC Class
004: Computer Sciences
006.3: Artificial Intelligence
519: Applied Mathematics, Probabilities
Publication version
draft
Lizenz
http://rightsstatements.org/vocab/InC/1.0/
Loading...
Thumbnail Image
Name

preprint_HPEC2024.pdf

Type

Main Article

Size

437.72 KB

Format

Adobe PDF

TUHH
Weiterführende Links
  • Contact
  • Send Feedback
  • Cookie settings
  • Privacy policy
  • Impress
DSpace Software

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science
Design by effective webwork GmbH

  • Deutsche NationalbibliothekDeutsche Nationalbibliothek
  • ORCiD Member OrganizationORCiD Member Organization
  • DataCiteDataCite
  • Re3DataRe3Data
  • OpenDOAROpenDOAR
  • OpenAireOpenAire
  • BASE Bielefeld Academic Search EngineBASE Bielefeld Academic Search Engine
Feedback