Unrolled CNN implementation
Created by: vloncar
Description
This is the refined version of a Conv1D/2D implementation that urolls the input feature matrix of im2col algorithm for the io_parallel
implementation. The general idea is to generate code for im2col transformation with exact instructions for each layer instead of synthesizing a generic C++ function because the HLS compiler has issues with it. With this implementation, I was able to synthesize layers with <= 4096 elements (the usual partitioning limit). The old implementations had trouble with far smaller layers.
Based on the unrolled im2col step, the implementation further uses an adapted matrix-vector multiplication for Resource
or Latency
strategy. Note that using overall Latency
strategy won't work as that will pipeline the entire design and cause all the loops to be urolled and this breaks the synthesis. Therefore, using Latency
strategy for the model will issue a warning and switch to the Resource
strategy (aka "dataflow"). Individual layers may still use the Latency
strategy.
A new turning knob is introduced to be combined with the ReuseFactor
to control the amount of parallelism: ParallelizationFactor
. This controls the number of output pixels processed in parallel. Defaults to 1, implying no parallelization. Valid values are divisors of the out_height * out_width
, though hls4ml will warn if the an incorrect ParallelizationFactor
is used.
One feature of this implementation that wasn't part of the original implementation from last year is the predictable II. In general, for Resource
strategy, II = (ReuseFactor + C) * out_height * out_width / ParallelizationFactor + 1
where C
is ~4. For Latency
strategy C
is 1-2. The +1 is for the function call itself.
This only touches the base Conv1D/2D layers, SeparableConv1D/2D will come as a later PR. PointwiseConv1D/2D needs investigation if it should be a special case at all with this implementation.
Limitations:
in_height * in_width * n_chan <= 4096
out_height * out_width * n_filt <= 4096
In order to wire all this, the core of the layers had to be extended. A new type of an attribute is introduced Source
, representing generated source code. Layers can have any number of generated source codes. Writer can pick up this information.
Type of change
-
New feature (non-breaking change which adds functionality) -
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Breaking in a sense that it replaces previous implementations and changes slightly the mechanics of how strategy.
Tests
The existing tests confirm the accuracy of the implementation.
Test Configuration:
Run any Conv1D/2D tests, just ensure io_parallel
is used. Play with ParallelizationFactor
and ReuseFactor
as desired. Don't forget the limitations above!
Checklist
-
I have read the guidelines for contributing. -
I have commented my code, particularly in hard-to-understand areas. -
I have made corresponding changes to the documentation. -
My changes generate no new warnings. -
I have added tests that prove my fix is effective or that my feature works.