Streaming implementations of Conv2D and friends (!220) · Merge requests · fastmachinelearning / hls4ml

Javier Duarte requested to merge github/fork/vloncar/cnn_stream into master Sep 14, 2020

Created by: vloncar

This PR introduces stream-based implementations of most of the layers. The main idea behind the CNN layers implementation is that instead of bookkeeping a position within the kernel window that comes from sequential processing using streams we encode the instructions for all pixels of the input into an array. To keep the array size in check (so it is not as large as the input image itself), we encode the instructions for a minimal input that exhibits the same access pattern. For example, every convolution with a 3x3 kernel with stride=1 regardless of input size can be represented with 25 unique instructions, corresponding to the minimal input size of 5x5. See the following animation for a demonstration (Input is at the top, output at the bottom. the top left corner contains the "pattern" for each pixel, while on the right we is the internal state during processing.)

To use it, just switch IOType to io_stream. This can be done through API (convert_from_keras_model) or in yml configuration.

Implemented layers:

Conv1D, Conv2D
- same padding is implemented by inserting a ZeroPadding2D layer before Conv2D (same for 1D)
- 1x1 Conv2D will be handled in a special way, same for pointwise Conv1D
SeparableConv1D, SeparableConv2D, DepthwiseConv2D
MaxPooling1D, MaxPooling2D, AveragePooling1D, AveragePooling2D
GlobalMaxPooling2D, GlobalAveragePooling1D
Dense
- Including QKeras variants
BatchNormalization
ReLU and Softmax activations
UpSampling2D
- Only nearest interpolation is supported
Add, Subtract, Multiply, Average, Maximum and Minimum

Conv1D/2D and Dense leverage the existing dense HLS implementation, so ReuseFactor and Strategy work.

Also included are dozens of small improvements to the internal model and the overall codebase.

Current limitations:

Square kernels (e.g., 2x2, 3x3...)
Maximum stride is the width of the kernel (i.e., you can't skip over some pixels) except for 1x1 (pointwise) convolution
No grouped or dilated convolutions

Earlier implementation introduced PackingFactor to cram more pixels into a single element of the stream. While this helps process the input faster it caused many issues when input and output packing were mismatched and increased the LUT/FF usage significantly. So it has been abandoned (for now). We are considering tiling as an alternative.

This is feature-complete and has been extensively tested by the CERN team so I consider it fairly stable and ready for a review. Some very minor changes and bugfixes may still come.

Questions, comments, praises and criticisms are welcome!

Streaming implementations of Conv2D and friends

Merge request reports