Al-Zoubi, AhmadAhmadAl-ZoubiMartino, GianlucaGianlucaMartinoBahnsen, Fin HendrikFin HendrikBahnsenZhu, JunJunZhuSchlarb, HolgerHolgerSchlarbFey, GörschwinGörschwinFey2022-11-072022-11-072022-0935th IEEE International System-on-Chip Conference (SOCC 2022)http://hdl.handle.net/11420/13970Developers have proposed various hardware accelerators to improve the CNN inference performance on embedded platforms. Recently, Xilinx announced its first 7-nm FPGA accelerator, the Versal ACAP, delivering a high-performance, heterogeneous computing platform adaptable to the application requirements. However, as early studies were concerned with the most common deep learning architectures for CNN, e.g. VGG, Resnet, Inception, etc., under full support of the Xilinx Vitis-AI, the implementation and analysis of the Versal ACAP performance with customized CNN architectures is yet to be explored. In this study, we implement one of the CNN architectures considered at the European XFEL and compare its performance to a state-of-the-art GPU and other FPGA generation. In addition, this study evaluates the validity of using the quantization methods for critical regression applications and presents a complete analysis of the results built upon the device time traces, providing recommendations for configuring the runtime parameters. The experimental results confirm a superior performance of the Versal ACAP in terms of latency and throughput. When the neural network layers were all supported by the ACAP processing unit, it achieved 17x and 18x better throughput and latency compared to GPU. In addition, when quantized using the fine-tuning method, the CNN model shows an improved accuracy compared to the floating-point model, with a reduction of 6% in loss.enCNNFPGAHeterogeneousQuantizationVersal ACAPMLE@TUHHCNN Implementation and Analysis on Xilinx Versal ACAP at European XFELConference Paper10.1109/SOCC56010.2022.9908101Conference Paper