Stable softmax
Created by: thesps
I've a bit more for the Softmax implementation. You may recall there was a new version introduced with PR #195. In issue #204 (closed), it was observed to not be working that well, possibly resulting in unphysical values (out of the range {0,1}) and giving incorrect predictions.
After some more experiments, I've done a few things:
- Changed the default type (when using
hls4ml.util.config
) to include theAP_RND
,AP_SAT
rounding and saturation modes. This bounds the values to the correct range - Added a new 'stable' implementation (documented below)
- Added a config parameter to select between implementations
- Based on the above parameter, added back the implementation used until v0.2.0 (now called 'legacy')
The new 'stable' version uses the same type-agnostic lookup tables as the recent new implementation. For the vector x
, Softmax does: y = exp(x) / sum(exp(x))
. In the code, lookup tables are used for exp(x)
and 1 / exp(x)
. The new version does y = exp(x - max(x)) / sum(exp(x - max(x)))
. This construction means that all the x - max(x)
values are either 0
(when x = max(x)
) or negative. This makes the lookup tables much nicer. The exponential table effectively saturates at the ~0 end rather than the +infinity end like just looking up exp(x)
.
On the jet tagging dataset, the version in master already performs well, and the new stable implementation performs similarly (just showing 2 classes for clarity):
The problem was seen in MNIST, where the accuracy is typically close to 100%. This means one class is typically predicted with a much greater probability than any other. Most of the time the implementation in master
works fine, but due to the big range in input values, sometimes class predictions which are actually quite different get clipped to the same probability (~1). Then when taking the argmax
to predict and evaluate accuracy, the wrong prediction can be made. The new 'stable' implementation resolves these nicely.
Here's the ROC just for one class:
It may look like it's still not doing that well, but the accuracy tells the story:
Inference | Accuracy |
---|---|
Keras | 0.9774 |
Master | 0.9720 |
Legacy | 0.9689 |
Stable | 0.9774 |
So the argmax
prediction matches Keras every time, in this case. The ROC differs slightly because there's not quite so much precision on the hls4ml prediction as the Keras one.
In terms of latency & resources, the new stable version is very slightly slower than either the current master, or 'legacy' versions due to having to find the max(x)
: 6 cycles for those vs 8 for the stable in a standalone Softmax project for 10 classes (MNIST).
The resources are most similar to the current master implementation, so ~0%
for all types, but slightly more LUTs and FFs are used. Both these are smaller than the legacy implementation.
So based on the latency & resources I've set the default implementation to be the one in current master (called latency
).
The selection is done in the config like cfg['LayerName']['softmax']['implementation'] = 'stable'
(options are latency, stable, legacy
). We might want to make the default stable
to 'just work' in more cases, but for low latency applications the latency
version is probably best.