1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.1
执行模式(PyNative/ Graph): 不限
2 报错信息
2.1 问题描述
在新机器上重装Ascend+MindSpore2.1环境,并行设置dp:mp:pp = 1:1:2时,报如下错误:(只要是设置了pp>1,都报下面的错误)
2.2 报错信息
[Traceback (most recent call last):
File "wizardcoder/run_wizardcoder.py", line 149, in <module>
device_id=args.device_id)
File "wizardcoder/run_wizardcoder.py", line 81, in main
task.train(train_checkpo int=ckpt, resume=resume)
File "/home/wizardcoder/1_wizardcoder-mindformers/mind formers/trainer/trainer.py", line 424, in train
is_full _config=True, **kwargs)
File " /home/wizardcoder/1_wizardcoder- mindformers/mindformers/trainer/caus al_language_modeling/causal_l anguage_modeling.py", line 104, in train
**kwargs)
File "/home/wizardcoder/1_wizardcoder -mindformers/mindformers/trainer/base_t rainer.py", line 631, in training_process
initial_epoch=config.runner_config.initial_epoch)
File "/root/miniconda3/envs /zxw/lib/python3.7/site-packages /mindspore/train/model .py", line 1066, in train
initial_epoch=initial_epoch)
File "/root/miniconda3/envs/zxw/lib/python3.7/site-pack ages/mindspore/train/model.py", line 113, in wrapper
func(self, *args, **kwargs)
File "/root/miniconda3/envs/zxw/lib/python3.7/site-packages/mindspore/train/model.py", line 620, in _train
cb_params, sink_size,initial epoch, valid infos)
File "/root/miniconda3/envs/zxw/lib/python3.7/site-pack ages/mindspore/train/model .py", line 703, in _train_dataset_s ink_process
outputs = train_network(*inputs)
File "/root/miniconda3/envs/zxw/lib/python3.7/site-pack ages/mindspore/nn/cell.py", line 637, in _call_
out = self.compile and run(*args **kwargs)
File "/root/miniconda3/envs/zxw/lib/python3.7/site-packages /mindspore/nn/cell.py", line 961, in compile_and_run
self.compile(*args, **kwargs)
File "/root/miniconda3/envs/zxw/l ib/python3.7/site-packages/mindspo re/nn/cell.py", line 939, in compile
jit_config dict=self.j it config_ dict, *compile_args, **kwargs)
File "/rootyminiconda3/envs7z xw/lib/python3.7/site-packages/mindspo re/common/api.py", line 1623, in compile
result = self. _graph executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
3 根因分析
应该是开启了cell共享的环境变量,所以报这个pipeline错误。
4 解决方案
两种解决方法:
- 关闭cell共享环境变量,使用命令 export MS_DEV_CELL_REUSE=0。Cell共享的作用是可优化编译性能。
- 如果不关闭cell共享,则需要配置相应的装饰器。
from mindformers.models.utils import cell_reuse
from mindformers.modules. transformer.moe import default_moe_config
from mindformers.modules.layers import LayerNorm, Dropout
from mindformers.core.loss import CrossEntropyLoss
@MindFormerRegister.register(MindFormerModuleType. MODELS)
class WizardCoderLMHeadModel (BaseModel):
r"""..."""
...
@cell_reuse()
def _init_(self, config: WizardCoderConfig = None):
config = config if configis not None else WizardCoderConfig()
super(WizardCoderLMHeadModel,self).__init__(config, auto_prefix=True)
【注意】:MindSpore2.2.0版本,cell_reuse装饰器的写法有变化,不需要后面的括号。
@MindFormerRegister.register(MindFormerModuleType. MODELS)
class WizardCoderLMHeadModel (BaseModel):
r"""..."""
...
@cell_reuse
def _init_(self, config: WizardCoderConfig = None):
config = config if configis not None else WizardCoderConfig()
super(WizardCoderLMHeadModel,self).__init__(config, auto_prefix=True)