Accelerating GPGPU simulation by strategically parallelizing the compute bottleneck

Cycle-accurate GPGPU simulators like GPGPU-Sim provide invaluable insights for hardware architecture research but suffer from extremely long runtimes, hindering research productivity. This paper addresses this critical bottleneck by proposing a strategy to accelerate GPGPU-Sim. We first perform a holistic profiling analysis across diverse GPGPU benchmarks to identify the primary performance bottleneck, pinpointing the SIMT-Core cluster execution within the CORE-clock cycle. Based on this, we implement a parallelization scheme that strategically targets this hotspot, utilizing a thread pool to manage concurrent execution of SIMT-Core clusters. Our approach prioritizes minimal modifications to the existing GPGPU-Sim codebase to ensure long-term maintainability. Evaluation of a simulated NVIDIA H100 model demonstrates an average simulation wall-time speedup of 3.58x with 8 worker threads, and a maximum up to 4.38x, while incurring a maximum cycle count error of 3.22%, with some other benchmarks exhibiting no error at all.

Subjects

GPGPU

CUDA

Simulation

Computer Architecture

GPGPU-Sim

Thread Pool

DDC Class

004: Computer Sciences

621.3: Electrical Engineering, Electronic Engineering

005: Computer Programming, Programs, Data and Security

Lizenz

https://creativecommons.org/licenses/by/4.0/

Publication version

submittedVersion

Name

Accelerating-GPGPU-Simulation.pdf

Size

870.83 KB

Format

Adobe PDF

Options

Accelerating GPGPU simulation by strategically parallelizing the compute bottleneck