Parallel CNNs, Pooling & Image Layers for Quartus Backend
Created by: bo3z
Description
📝 Convolutional, Pooling & Image Layers for Quartus Backend
- 9219a0ea Adds support for Conv1D & Conv2D Layers using im2col, in a similar manner to Vivado.
- For
fs = 3
andfs = 1
an optimized convolution is implemented using Winograd's minimal filtering algorithm (f7365baf) and pointwise im2col (29bd0267)- Introduces the idea of parallelisation factor, allowing for fully unrolled convolution, executing in 8 clock cycles
- aa662478 Support for Max & Avg Pooling as well as Global Pooling layers
- 8cb9f21e Support for Zero Padding, Transpose and Upsampling layer
- 3952854c Corresponding PyTests and HLS resources/latency analysis for all the above layers.
- aa662478 Adds support for Vivado 2D Global Pooling (new feature on Vivado)
Type of change
-
New feature (non-breaking change which adds functionality) It is recommended to review this PR commit by commit (rather than side-by-side diff), as each commit add a specific feature and this is a fairly extensive PR. Each commit is self-contained and can be checked out and the project compiled.
Implementation details
- As a base, convolutional layers are implemented in a similar way to Vivado, using the im2col algorithm. im2col transforms the input matrix to a larger, patch matrix suitable for matrix multiplication with the kernel. This way, the computationally more complex convolution is replaced with dense matrix multiplication. Loops traversing the number of filters and channels are fully unrolled, as the total number of filters in
io_parallel
is usually low (less than 16), allowing for a constant latency with respect to the number of filters. The loops traversing rows and columns of the input image are pipelined with an initiation interval determined by the reuse factor. A larger reuse factor will reduce resource usage and allow for larger architectures, at the expense of latency. - An optimized convolution for 3x3 kernels is implemented using Winograd Minimal Filtering Algorithm. For a more detailed description of Winograd's minimal algorithm, see:
- Lavin & Gray (2015). Fast Algorithms for Convolutional Neural Networks
- Xygkis et. al. (2018). Efficient Winograd-based Convolution Kernel Implementation on Edge Devices
Winograd minimal filtering algorithm relies on a series of input and kernel matrix transformation, replacing convolution with an elementwise product. Kernels can be transformed offline, prior FPGA inference. Input matrix transformation can be explicitly written out - when done in such a way, the output matrix can be obtained through additions and subtractions of elements of the original matrix. This way, the transformation is implemented through combinational logic, reducing latency. Winograd's algorithm offers the lowest computationally complexity for convolution, by considering multiple output pixels at once. This way, the stride over the input image is larger than one. For example, for 3x3 kernels, loops iterating over the height (H) and width (W) of the input image are invoked H/2 and W/2 times, respectively, compared to im2col, which invokes the loops H and W times. Each loop has ah higher, but instructions within a loop can usually be executed through combinational logic and register reads/writes;. Winograd's algorithm has several disadvantages, including:
- Cannot be used for stride != 1 without significant modifications and latency penalties.
- Different implementations (read, matrix transformations) are needed for different kernel sizes. This PR implements Winograd's algorithm for the most commonly used, 3x3 and 3x1 kernel size.
- Numerically unstable - Winograd's algorithm is built on top of Lagrange interpolation, known for its poor numerical properties. Therefore, with an increase in kernel size, Winograd's algorithm has a non-negligible error. This error is also noticeable with aggressive quantization, whether through QAT or PTQ. Both problems have been researched but addressing them would come at a latency cost, due to additional transformation and more complex transformation matrices. For 3x3 kernels, this error is negligible and can be used without any loss in accuracy.
-
Pointwise im2col - Similar to PR #471, an optimised implementation for 1x1 kernels of im2col is added.
-
This PR introduces the idea of a parallelisation factor, in addition to the previously used reuse factor. Reuse factor controls the initiation interval between loops traversing the input image. A large reuse factor will increase the initiation interval, latency and reduce resource usage. On the other hand, the parallelisation factor determines the unroll factor of loops traversing the input image. A larger parallelisation factor will create multiple copies of the loop, lowering latency. The outer loop (input height) is only unrolled if the inner (input width) is fully unrolled. Using this approach, it is possible to compute a full convolutional layer in 8 clock cycles (with a large resource utilisation). Therefore, both these values should be tweaked accordingly when designing an architecture - for larger inputs and models, the reuse factor should be increased to allow fitting onto the available device resources, while keeping the parallelisation factor one. On the other hand, for individual layer with a small input (deeper in the network), the parallelisation can be increased, allowing faster inference. Below are some results with respect to changing the reuse and parallelisation factor. Both of these variables are available for both im2col and Winograd.
-
Support for Average and Max 1D & 2D Pool layers, as well as Global Pooling. Through experiments (see results below), it was observed that the optimal implementation was a fully unrolled one - it minimises both resource usage and latency (kind of hard to explain why this happens ?). Finally, support for Vivado 2D Global Pooling was added, just for completeness sake.
Latency and resource usage
As expected from theory, Winograd has a lower latency as well as resource usage, when compared to im2col. All test were targeting an Agilex F14, with 10 data points and full Quartus synthesis. Results for different RF & PF will be added once the scan is complete.
As stated above, a fully unrolled pooling layer is optimal. While a pooling layer has no notion of a reuse factor, increasing the overall reuse factor should help reduce the resource and fit the desired architecture, as the reuse factor also dictates the component initiation interval.
Tests
- test_cnn_mnist.py - a new unit test, testing the accuracy of a Keras CNN network and its hls4ml counterpart in classifying MNIST digits
- test_cnn_mnist_qkeras.py - renamed from test_cnn_mnist.py and included Quartus backend
- test_conv1d.py - included Quartus backend
- test_global_pooling.py - renamed from test_global_pooling1d.py, included 2D Global Pooling and Quartus as an additional backend, in addition to Vivado.
- test_keras_api.py - Adds a basic threshold check, verifying the output of a Conv/Pooling layer from hls4ml is apporximately equal to Keras
- test_pointwiseconv.py - included Quartus backend
- test_upsampling.py - included Quartus backend
- test_zeropading.py - included Quartus backend
Checklist
-
I have read the guidelines for contributing. -
I have commented my code, particularly in hard-to-understand areas. -
I have made corresponding changes to the documentation. -
My changes generate no new warnings. -
I have added tests that prove my fix is effective or that my feature works.