Implement Serial mode directives in Dense Layer
Created by: ejk43
I saw updates to the recent serial mode PR (https://github.com/hls-fpga-machine-learning/hls4ml/pull/45) -- Looks good!!
I noticed that the top-level IOs were still fully partitioned, and the dense layer directives were too aggressively serialized-- since essentially no directives were applied in serial mode at all, HLS defaults to everything being serialized. It's a conservative approach, but generally not very useful from a latency/throughput standpoint :)
This PR adds a few edits that make the dense layer in nnet_layer.h somewhat useful in serial mode! I'm pretty happy with how it turned out, and it's pretty close to what I achieved last year in my original library. I think it should be merge-able -- please review / provide feedback otherwise.
Quick overview of edits:
- Removes all ARRAY_RESHAPE / ARRAY_PARTITION directives from the top level. This lets the top-level IOs remain serial inputs and outputs. I also converted the IOs to "handshake" type (ap_hs).
- Adds the DATAFLOW directive to the top level, and corresponding HLS STREAM directives that indicate the variables should be implemented as streaming FIFOs instead of ping-pong ram (https://www.xilinx.com/html_docs/xilinx2017_2/sdaccel_doc/topics/pragmas/ref-pragma_HLS_stream.html)
- Set up DATAFLOW, PIPELINE, and ARRAY_PARTITION directives in the dense layer to achieve a "good" balance of resources vs latency for serial mode, including a reasonable usage of the reuse_factor.
Here's a quick overview of my assumptions/expectations for serial-mode:
- The number of multipliers in the dense layer will be driven by the N_OUT parameter; that is, the matrix-multiply for-loop will be pipelined (by default) such that there are N_OUT parallelized multipliers.
- The dense layer will be able to accept one new input per clock, perform the matrix multiply, and then clock out one output value at a time. The II for the serialized dense layer should be nominally "close" to the maximum of N_IN and N_OUT (I do see higher IIs, likely due to other glue logic)
- Adding a reuse_factor will reduce the number of multipliers required for the dense layer, at the expense of a larger II into the matrix multiply. This does not necessarily mean the total II is higher for the dense layer... for example, the 10x32 dense layer in the basic example maintains the same total II when reuse_factor = 2 or 3, because this means it takes 20 or 30 samples to clock in the 10 inputs rather than 10 clocks, but the II was already > 32 because of the output size.
Utilization result from the basic dense-layer example:
* Summary:
+-----------------+---------+-------+--------+--------+
| Name | BRAM_18K| DSP48E| FF | LUT |
+-----------------+---------+-------+--------+--------+
|DSP | -| -| -| -|
|Expression | -| -| -| -|
|FIFO | 0| -| 15| 90|
|Instance | 17| 33| 1781| 3749|
|Memory | -| -| -| -|
|Multiplexer | -| -| -| -|
|Register | -| -| 9| -|
+-----------------+---------+-------+--------+--------+
|Total | 17| 33| 1805| 3839|
+-----------------+---------+-------+--------+--------+
|Available | 2940| 3600| 866400| 433200|
+-----------------+---------+-------+--------+--------+
|Utilization (%) | ~0 | ~0 | ~0 | ~0 |
+-----------------+---------+-------+--------+--------+
And the ports show only 18-bit interfaces to data and res:
+-----------------------+-----+-----+------------+----------------+--------------+
| RTL Ports | Dir | Bits| Protocol | Source Object | C Type |
+-----------------------+-----+-----+------------+----------------+--------------+
|data_V | in | 18| ap_hs | data_V | pointer |
|data_V_ap_vld | in | 1| ap_hs | data_V | pointer |
|data_V_ap_ack | out | 1| ap_hs | data_V | pointer |
|res_V | out | 18| ap_hs | res_V | pointer |
|res_V_ap_vld | out | 1| ap_hs | res_V | pointer |
|res_V_ap_ack | in | 1| ap_hs | res_V | pointer |
|const_size_in | out | 16| ap_vld | const_size_in | pointer |
|const_size_in_ap_vld | out | 1| ap_vld | const_size_in | pointer |
|const_size_out | out | 16| ap_vld | const_size_out | pointer |
|const_size_out_ap_vld | out | 1| ap_vld | const_size_out | pointer |
|ap_clk | in | 1| ap_ctrl_hs | myproject | return value |
|ap_rst | in | 1| ap_ctrl_hs | myproject | return value |
|ap_start | in | 1| ap_ctrl_hs | myproject | return value |
|ap_done | out | 1| ap_ctrl_hs | myproject | return value |
|ap_idle | out | 1| ap_ctrl_hs | myproject | return value |
|ap_ready | out | 1| ap_ctrl_hs | myproject | return value |
+-----------------------+-----+-----+------------+----------------+--------------+
Timing and Latency also look nice:
+ Timing (ns):
* Summary:
+--------+-------+----------+------------+
| Clock | Target| Estimated| Uncertainty|
+--------+-------+----------+------------+
|ap_clk | 5.00| 4.14| 0.62|
+--------+-------+----------+------------+
+ Latency (clock cycles):
* Summary:
+-----+-----+-----+-----+----------+
| Latency | Interval | Pipeline |
| min | max | min | max | Type |
+-----+-----+-----+-----+----------+
| 48| 48| 38| 38| dataflow |
+-----+-----+-----+-----+----------+
Thoughts??