WIP: First attempt at RNN (!43) · Merge requests · fastmachinelearning / hls4ml

Javier Duarte requested to merge ejk/recursive into master Feb 11, 2018

Created by: ejk43

I have some (very basic) functionality working for a RNN. Thought I would go ahead and generate a Work-In-Progress PR for discussion.

Current status:

Added nnet_recursive.h file for RNN / LSTM functions
Added simple_rnn function based on the Keras naming convention ("SimpleRNN" ??). Not sure this is the correct usage of the SimpleRNN, so feel free to change as needed
The existing simple_rnn represents the "single" RNN cell (https://github.com/pangolulu/rnn-from-scratch and http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
Added example-prj rnn-test to exercise floating point functionality of the simple_rnn function

So, the rnn-test project compiles and displays some text output I used to verify the dot product implementations and static state variable storage.

A few notes...

I did not even begin to synthesize the function yet -- I just confirmed the simple_rnn function compiles and the algorithm is implemented correctly on a nominal test case. I'm using no compiler directives, and floating point calculations only... Not sure what the synthesis results will be, but probably not pretty :)
I modeled the dot products on the staged multiply/accumulate approach of the fully-connected compute_layer. It appears the compute_layer function was implemented with vector/matrix sizes as [1xN]*[NxM]; whereas the provided links demonstrating RNN functions use [MxN]*[Nx1]. This requires that the matrix values be transposed from what the reference equations dictate. Do you have a preferred convention here? Should we flip to match the reference? As I'm racking my memory, I feel like I may have at one point began implementation on the compute_layer using the conventional ordering, but found the array/BRAM partitioning and access patterns to be more natural with the opposite order. That said, I'd be open to try changing now; the reversed order is somewhat confusing.
The simple_rnn function is designed to be a single "cell" of the RNN, which is then called multiple times with new input values. Looking at the rnn-test/firmware/myproject.cpp, you can see the for-loop wrapping the simple_rnn function call. I'm not sure yet how we'd want to best handle the issue of re-calling the simple_rnn... It's worth some thought, and it might turn out to be use-case specific... for example, it might make sense to move the for-loop into the nnet_recursive.h library when the RNN needs to iterate over a vector of features at a single timestep, as you demonstrated in the power-point discussion a couple weeks ago. On the other hand, if the RNN iterates over a vector of features across time (say for speech recognition), then it might make sense to leave the for-loop iteration outside the simple_rnn call. Any thoughts on this level of architecture?

As an interesting aside, it appears that all three matrix-multiplies in the RNN may be performed simultaneously in fabric, since no multiplications rely on the output of a previous multiplication in the same timestep (assuming, of course, you can accept a one-cycle delay on the output value, ie o(t) becomes o(t+1) and o(0) = 0). I'm not sure how to set up the compiler directives to indicate this behavior yet... Perhaps pipeline will take care of it all just fine.

Hopefully this can help provide a starting point, or at least some inspiration. I'll try to look into the conversion to fixed point and the associated compiler directives, which I'm expecting to be fairly challenging to get right here.

WIP: First attempt at RNN

Merge request reports