Quartus GRU
Created by: bo3z
Description
📝 Gated Recurrent Units (GRUs) for Quartus backend
- Adds support for GRU Units in Quartus backend
- Similar to work done on the Vivado backend (#576) and Quartus LSTM & SimpleRNN (#575)
Type of change
-
New feature (non-breaking change which adds functionality)
Tests
- Accuracy tests through PyTest, for more details see
test/pytest/test_rnn.py
- IP simulation using cosim.
- Successful synthesis and analysis of device resources and latency (see blow)
Checklist
-
I have read the guidelines for contributing. -
I have commented my code, particularly in hard-to-understand areas. -
I have made corresponding changes to the documentation. -
My changes generate no new warnings. -
I have added tests that prove my fix is effective or that my feature works.
Implementation
- HLS code for GRU layers consist of two main functions:
-
gru_cell(t, h, weights, recurrent_weights, bias, recurent_bias)
- which takes in the input vector, x, and hidden state, h, at time step t, and produces the new hidden state according to GRU logic (reset, update, candidate state gate) - this function has several loops over the number of GRU units/states; therefore, those loops are often unrolled with the appropriate reuse factor. For results on resource usage and latency, see below. -
gru(data, res, weights, recurrent_weights, bias, recurent_bias)
- makes use of the previously mentioned function, by traversing through the data at each time step and obtaining the new state, until the final output is obtained. Note, it is not possible to pipeline this function, because there is a loop dependency (LD). Namely, the at every iteration, the state needs to be available so that the new state can be calculated.
- The backend containes a layer initialiser and the appropriate templates. Matrix multiplication and bias addition is done through the
Dense
layer. Finally, a resource strategy optimizer handles matrix transposes needed forDense
multiplication, rather than being done in layer initialising procedures.
Results
Below are latency, DSP, REG and ALM usage results of a GRU layer with a 5-dimensional input, 8 time steps and a variable number of units.
As expected, the latency remains approximately constant when increasing the number of units, while DSPs, REGs and ALM increase at a linear rate. This occurs because the implementation contains several loops unrolled over the number of units/states. Therefore, such an implementation is time-invariant, but resource-ineffficient.
Finally, with the units fixed to 8 and the input size to 5, similar plots are obtained. As the time loop has pipelining disabled (due to loop dependencies), the use of DSPs remains approximately constant. ALMs and REGs increase slightly, because a larger input needs to be stored. The latency increases at a linear rate, as expected.