Streaming implementations of Conv2D and friends
Created by: vloncar
This PR introduces stream-based implementations of most of the layers. The main idea behind the CNN layers implementation is that instead of bookkeeping a position within the kernel window that comes from sequential processing using streams we encode the instructions for all pixels of the input into an array. To keep the array size in check (so it is not as large as the input image itself), we encode the instructions for a minimal input that exhibits the same access pattern. For example, every convolution with a 3x3
kernel with stride=1
regardless of input size can be represented with 25 unique instructions, corresponding to the minimal input size of 5x5
. See the following animation for a demonstration
(Input is at the top, output at the bottom. the top left corner contains the "pattern" for each pixel, while on the right we is the internal state during processing.)
To use it, just switch IOType
to io_stream
. This can be done through API (convert_from_keras_model
) or in yml configuration.
Implemented layers:
-
Conv1D
,Conv2D
-
same
padding is implemented by inserting aZeroPadding2D
layer beforeConv2D
(same for 1D) - 1x1
Conv2D
will be handled in a special way, same for pointwiseConv1D
-
-
SeparableConv1D
,SeparableConv2D
,DepthwiseConv2D
-
MaxPooling1D
,MaxPooling2D
,AveragePooling1D
,AveragePooling2D
-
GlobalMaxPooling2D
,GlobalAveragePooling1D
-
Dense
- Including QKeras variants
BatchNormalization
-
ReLU
andSoftmax
activations -
UpSampling2D
- Only
nearest
interpolation is supported
- Only
-
Add
,Subtract
,Multiply
,Average
,Maximum
andMinimum
Conv1D/2D
and Dense
leverage the existing dense
HLS implementation, so ReuseFactor
and Strategy
work.
Also included are dozens of small improvements to the internal model and the overall codebase.
Current limitations:
- Square kernels (e.g., 2x2, 3x3...)
- Maximum stride is the width of the kernel (i.e., you can't skip over some pixels) except for 1x1 (pointwise) convolution
- No grouped or dilated convolutions
Earlier implementation introduced PackingFactor
to cram more pixels into a single element of the stream. While this helps process the input faster it caused many issues when input and output packing were mismatched and increased the LUT/FF usage significantly. So it has been abandoned (for now). We are considering tiling as an alternative.
This is feature-complete and has been extensively tested by the CERN team so I consider it fairly stable and ready for a review. Some very minor changes and bugfixes may still come.
Questions, comments, praises and criticisms are welcome!