Generic Fused Optimizer
Bagua provides a generic fused optimizer which improves the performance of optimizers by fusing the optimizer .step()
operation on multiple layers. It can be applied to any PyTorch optimizer.
Usage
To use the fused optimizer given an existing PyTorch optimizer, do:
- convert the optimizer to a fused optimizer using
bagua.contrib.fuse_optimizer()
, - perform optimizer step operation with the
.fuse_step()
method.
Here is an example of how to use fused optimizer with a popular custom BertAdam
optimizer (any other PyTorch optimizers including PyTorch built-in optimizers will also work):
from pytorch_pretrained_bert.optimization import BertAdam
import bagua.torch_api as bagua
optimizer = BertAdam(parameters, ...)
optimizer = bagua.contrib.fuse_optimizer(optimizer, do_flatten=True)
for batch, input in enumerate(trainloader):
model.zero_grad()
...
optimizer.fuse_step()
See API documentation for more details.
Integration with Bagua distributed communication algorithms
Fused optimizer and Bagua module both reset the storage pointer of parameters to flatten tensors by default. In order to perform a
more effective fused update, it is recommended to disable tensor flatten in
with_bagua
by setting its do_flatten
parameter to False
:
model = model.with_bagua([optimizer], do_flatten=False)
Benchmark
On BERT base model, with 8 NVIDIA Tesla V100 GPUs. The benchmark result shows that by enabling generic fused optimizer, the end-to-end training time can be reduced by 8%.
w/o Fused Optimizer | w. Fused Optimizer | |
---|---|---|
Epoch Time (s) | 3324 | 3055 |
Accuracy | 0.9254 | 0.9260 |