Jakob, SachsSachsJakobLühnen, Tim JuliusTim JuliusLühnenLal, SohanSohanLal2026-01-272026-01-272026-01PARMA-DITAM 2026https://hdl.handle.net/11420/61087Cycle-accurate GPGPU simulators like GPGPU-Sim provide invaluable insights for hardware architecture research but suffer from extremely long runtimes, hindering research productivity. This paper addresses this critical bottleneck by proposing a strategy to accelerate GPGPU-Sim. We first perform a holistic profiling analysis across diverse GPGPU benchmarks to identify the primary performance bottleneck, pinpointing the SIMT-Core cluster execution within the CORE-clock cycle. Based on this, we implement a parallelization scheme that strategically targets this hotspot, utilizing a thread pool to manage concurrent execution of SIMT-Core clusters. Our approach prioritizes minimal modifications to the existing GPGPU-Sim codebase to ensure long-term maintainability. Evaluation of a simulated NVIDIA H100 model demonstrates an average simulation wall-time speedup of 3.58x with 8 worker threads, and a maximum up to 4.38x, while incurring a maximum cycle count error of 3.22%, with some other benchmarks exhibiting no error at all.enhttps://creativecommons.org/licenses/by/4.0/GPGPUCUDASimulationComputer ArchitectureGPGPU-SimThread PoolComputer Science, Information and General Works::004: Computer SciencesTechnology::621: Applied Physics::621.3: Electrical Engineering, Electronic EngineeringComputer Science, Information and General Works::005: Computer Programming, Programs, Data and SecurityAccelerating GPGPU simulation by strategically parallelizing the compute bottleneckConference Paperhttps://doi.org/10.15480/882.1657310.15480/882.16573Conference Paper