Latency-optimized hardware acceleration of multilayer perceptron inference

Zoubi, Ahmad al-Ahmad al-ZoubiSchaible, BenediktBenediktSchaibleMartino, GianlucaGianlucaMartinoFey, GörschwinGörschwinFey2024-04-262024-04-26202326th Euromicro Conference on Digital System Design: 235-241 (2023)979-835034419-6https://hdl.handle.net/11420/47298Decreasing the inference latency of neural networks is crucial in situations where real-time responses are necessary. We propose a new neuron architecture for parallel computations, targeting the MLP implementation on an FPGA. The parallelism in the proposed architecture is exposed through the segmentation of non-linear activation functions into a set of linear segments, delivering highly accurate estimations of the original function. The implementation combines various other optimization techniques, such as fixed-point arithmetics, pipelining, array partitioning, and loop unrolling. For the validation of the proposed architecture using the Xilinx Vitis HLS toolchain, four MLPs with a mix of non-linear activation functions have been implemented and evaluated in comparison to accelerated models produced by the open-source tool hls4ml, a Python package for latency-optimized machine learning inference in FPGAs. Experimental results clearly show that our proposed architecture outperformed the corresponding hls4ml model with up to three times speedups.enFPGAMLPNon-linear activation functionParallelSegmentationMLE@TUHHComputer SciencesEngineering and Applied OperationsLatency-optimized hardware acceleration of multilayer perceptron inferenceConference Paper10.1109/DSD60849.2023.00042Conference Paper