Latency-optimized hardware acceleration of multilayer perceptron inference

Decreasing the inference latency of neural networks is crucial in situations where real-time responses are necessary. We propose a new neuron architecture for parallel computations, targeting the MLP implementation on an FPGA. The parallelism in the proposed architecture is exposed through the segmentation of non-linear activation functions into a set of linear segments, delivering highly accurate estimations of the original function. The implementation combines various other optimization techniques, such as fixed-point arithmetics, pipelining, array partitioning, and loop unrolling. For the validation of the proposed architecture using the Xilinx Vitis HLS toolchain, four MLPs with a mix of non-linear activation functions have been implemented and evaluated in comparison to accelerated models produced by the open-source tool hls4ml, a Python package for latency-optimized machine learning inference in FPGAs. Experimental results clearly show that our proposed architecture outperformed the corresponding hls4ml model with up to three times speedups.

Subjects

FPGA

MLP

Non-linear activation function

Parallel

Segmentation

MLE@TUHH

DDC Class

004: Computer Sciences

620: Engineering

Options

Latency-optimized hardware acceleration of multilayer perceptron inference