Quartus GRU (!596) · Merge requests · fastmachinelearning / hls4ml

Javier Duarte requested to merge github/fork/bo3z/quartus-gru into main Jul 11, 2022

Created by: bo3z

Description

📝 Gated Recurrent Units (GRUs) for Quartus backend

Adds support for GRU Units in Quartus backend

Similar to work done on the Vivado backend (#576) and Quartus LSTM & SimpleRNN (#575)

Type of change

New feature (non-breaking change which adds functionality)

Tests

Accuracy tests through PyTest, for more details see test/pytest/test_rnn.py

IP simulation using cosim.

Successful synthesis and analysis of device resources and latency (see blow)

Checklist

I have read the guidelines for contributing.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have added tests that prove my fix is effective or that my feature works.

Implementation

HLS code for GRU layers consist of two main functions:

gru_cell(t, h, weights, recurrent_weights, bias, recurent_bias) - which takes in the input vector, x, and hidden state, h, at time step t, and produces the new hidden state according to GRU logic (reset, update, candidate state gate) - this function has several loops over the number of GRU units/states; therefore, those loops are often unrolled with the appropriate reuse factor. For results on resource usage and latency, see below.
gru(data, res, weights, recurrent_weights, bias, recurent_bias) - makes use of the previously mentioned function, by traversing through the data at each time step and obtaining the new state, until the final output is obtained. Note, it is not possible to pipeline this function, because there is a loop dependency (LD). Namely, the at every iteration, the state needs to be available so that the new state can be calculated.

The backend containes a layer initialiser and the appropriate templates. Matrix multiplication and bias addition is done through the Dense layer. Finally, a resource strategy optimizer handles matrix transposes needed for Dense multiplication, rather than being done in layer initialising procedures.

Results

Below are latency, DSP, REG and ALM usage results of a GRU layer with a 5-dimensional input, 8 time steps and a variable number of units.

As expected, the latency remains approximately constant when increasing the number of units, while DSPs, REGs and ALM increase at a linear rate. This occurs because the implementation contains several loops unrolled over the number of units/states. Therefore, such an implementation is time-invariant, but resource-ineffficient.

Finally, with the units fixed to 8 and the input size to 5, similar plots are obtained. As the time loop has pipelining disabled (due to loop dependencies), the use of DSPs remains approximately constant. ALMs and REGs increase slightly, because a larger input needs to be stored. The latency increases at a linear rate, as expected.

Quartus GRU

Type of change

Tests

Checklist

Implementation

Results

Merge request reports