Skip to content

Quartus GRU

Javier Duarte requested to merge github/fork/bo3z/quartus-gru into main

Created by: bo3z

Description

📝 Gated Recurrent Units (GRUs) for Quartus backend

  • Adds support for GRU Units in Quartus backend
  • Similar to work done on the Vivado backend (#576) and Quartus LSTM & SimpleRNN (#575)

Type of change

  • New feature (non-breaking change which adds functionality)

Tests

  • Accuracy tests through PyTest, for more details see test/pytest/test_rnn.py
  • IP simulation using cosim.
  • Successful synthesis and analysis of device resources and latency (see blow)

Checklist

  • I have read the guidelines for contributing.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.

Implementation

  • HLS code for GRU layers consist of two main functions:
  1. gru_cell(t, h, weights, recurrent_weights, bias, recurent_bias) - which takes in the input vector, x, and hidden state, h, at time step t, and produces the new hidden state according to GRU logic (reset, update, candidate state gate) - this function has several loops over the number of GRU units/states; therefore, those loops are often unrolled with the appropriate reuse factor. For results on resource usage and latency, see below.
  2. gru(data, res, weights, recurrent_weights, bias, recurent_bias) - makes use of the previously mentioned function, by traversing through the data at each time step and obtaining the new state, until the final output is obtained. Note, it is not possible to pipeline this function, because there is a loop dependency (LD). Namely, the at every iteration, the state needs to be available so that the new state can be calculated.
  • The backend containes a layer initialiser and the appropriate templates. Matrix multiplication and bias addition is done through the Dense layer. Finally, a resource strategy optimizer handles matrix transposes needed for Dense multiplication, rather than being done in layer initialising procedures.

Results

Below are latency, DSP, REG and ALM usage results of a GRU layer with a 5-dimensional input, 8 time steps and a variable number of units.

As expected, the latency remains approximately constant when increasing the number of units, while DSPs, REGs and ALM increase at a linear rate. This occurs because the implementation contains several loops unrolled over the number of units/states. Therefore, such an implementation is time-invariant, but resource-ineffficient. latency - units dsp - units reg - units alm - units

Finally, with the units fixed to 8 and the input size to 5, similar plots are obtained. As the time loop has pipelining disabled (due to loop dependencies), the use of DSPs remains approximately constant. ALMs and REGs increase slightly, because a larger input needs to be stored. The latency increases at a linear rate, as expected. latency-time steps dsp - time steps reg - time steps alm - time steps

Merge request reports

Loading