Force allreduce of all gradients in step(), bugfixes
Add compatibility with PyTorch 0.4.1
Support for IBM PowerAI DDL & APIs to restore optimizer state
Critical Bugfix: PyTorch must wait for GPU data before allreduce
Critical Bugfix: non-fused allreduce produces incorrect results
Fix PyTorch master break - use proper THTensor_storage() API
Bugfix mpi4py: Create a private MPI communicator
Collective control plane & other low latency improvements
Bugfix: `python setup.py install` requires cffi
Support TensorFlow Optimizers that override compute_gradients/apply_gradients
Avoid deadlock if worker crashes with an exception
New major release
Add Keras ImageNet training example and LearningRateScheduleCallback