Use MS-AMP

Basic usage#

Enabling MS-AMP is very simple when traning model w/o any distributed parallel technologies, you only need to add one line of code msamp.initialize(model, optimizer, opt_level) after defining model and optimizer.

Example:

import msamp
# Declare model and optimizer as usual, with default (FP32) precisionmodel = torch.nn.Linear(D_in, D_out).cuda()optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
# Allow MS-AMP to perform casts as required by the opt_levelmodel, optimizer = msamp.initialize(model, optimizer, opt_level="O2")...

Usage in DeepSpeed#

MS-AMP supports FP8 for distributed parallel training and has the capability of integrating with advanced distributed traning frameworks. We have integrated MS-AMP with several popular distributed training frameworks such as DeepSpeed, Megatron-DeepSpeed and Megatron-LM to demonstrate this capability.

For enabling MS-AMP in DeepSpeed, add one line of code from msamp import deepspeed at the beginging and a "msamp" section in DeepSpeed config file:

"msamp": {  "enabled": true,  "opt_level": "O1|O2|O3",  "use_te": false}

"O3" is designed for FP8 in ZeRO optimizer, so please make sure ZeRO is enabled when using "O3". "use_te" is designed for Transformer Engine, if you have already used Transformer Engine in your model, don't forget to set "use_te" to true.

Usage in FSDP#

When using FSDP, enabling MS-AMP is very easy, just use FsdpReplacer.replace and FP8FullyShardedDataParallel to initialize model and FSDPAdam to initialize optimizer.

Example:

# Your modelmodel = ...
# Initialize modelfrom msamp.fsdp import FsdpReplacerfrom msamp.fsdp import FP8FullyShardedDataParallelmy_auto_wrap_policy = ...model = FsdpReplacer.replace(model)model = FP8FullyShardedDataParallel(model, use_orig_params=True, auto_wrap_policy=my_auto_wrap_policy)
# Initialize optimizerfrom msamp.optim import FSDPAdamoptimizer = FSDPAdam(model.parameters(), lr=3e-04)

Please note that currently we only support use_orig_params=True.

Usage in Megatron-DeepSpeed and Megatron-LM#

For integrating MS-AMP with Megatron-DeepSpeed and Megatron-LM, you need to make some code changes. We provide a patch as a reference for the integration. Here is the instruction of integrating MS-AMP with Megatron-DeepSpeed/Megatron-LM and how to run gpt-3 with MS-AMP.

Runnable, simple examples demonstrating good practices can be found here. For more comprehensive examples, please go to MS-AMP-Examples.