MindSpore模型Pipeline并行训练报错RuntimeError: Stage 0 should has at least 1 parameter. but got none.

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.1
执行模式(PyNative/ Graph): 不限

2 报错信息

2.1 问题描述

在新机器上重装Ascend+MindSpore2.1环境,并行设置dp:mp:pp = 1:1:2时,报如下错误:(只要是设置了pp>1,都报下面的错误)

2.2 报错信息

[Traceback (most recent call last):  
  File "wizardcoder/run_wizardcoder.py", line 149, in <module>  
    device_id=args.device_id)  
  File "wizardcoder/run_wizardcoder.py", line 81, in main  
    task.train(train_checkpo int=ckpt, resume=resume)  
  File "/home/wizardcoder/1_wizardcoder-mindformers/mind formers/trainer/trainer.py", line 424, in train  
    is_full _config=True, **kwargs)  
  File " /home/wizardcoder/1_wizardcoder- mindformers/mindformers/trainer/caus al_language_modeling/causal_l anguage_modeling.py", line 104, in train  
    **kwargs)  
  File "/home/wizardcoder/1_wizardcoder -mindformers/mindformers/trainer/base_t rainer.py", line 631, in training_process  
    initial_epoch=config.runner_config.initial_epoch)  
  File "/root/miniconda3/envs /zxw/lib/python3.7/site-packages /mindspore/train/model .py", line 1066, in train  
    initial_epoch=initial_epoch)  
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-pack ages/mindspore/train/model.py", line 113, in wrapper  
    func(self, *args, **kwargs)  
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-packages/mindspore/train/model.py", line 620, in _train  
    cb_params, sink_size,initial epoch, valid infos)  
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-pack ages/mindspore/train/model .py", line 703, in _train_dataset_s ink_process  
    outputs = train_network(*inputs)  
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-pack ages/mindspore/nn/cell.py", line 637, in _call_  
    out = self.compile and run(*args **kwargs)  
  File "/root/miniconda3/envs/zxw/lib/python3.7/site-packages /mindspore/nn/cell.py", line 961, in compile_and_run  
    self.compile(*args, **kwargs)  
    File "/root/miniconda3/envs/zxw/l ib/python3.7/site-packages/mindspo re/nn/cell.py", line 939, in compile  
    jit_config dict=self.j it config_ dict, *compile_args, **kwargs)  
  File "/rootyminiconda3/envs7z xw/lib/python3.7/site-packages/mindspo re/common/api.py", line 1623, in compile  
    result = self. _graph executor.compile(obj, args, kwargs, phase, self._use_vm_mode())

3 根因分析

应该是开启了cell共享的环境变量,所以报这个pipeline错误。

4 解决方案

两种解决方法:

  1. 关闭cell共享环境变量,使用命令 export MS_DEV_CELL_REUSE=0。Cell共享的作用是可优化编译性能。
  2. 如果不关闭cell共享,则需要配置相应的装饰器。
from mindformers.models.utils import cell_reuse  
from mindformers.modules. transformer.moe import default_moe_config  
from mindformers.modules.layers import LayerNorm, Dropout  
from mindformers.core.loss import CrossEntropyLoss  
    
@MindFormerRegister.register(MindFormerModuleType. MODELS)  
class WizardCoderLMHeadModel (BaseModel):  
    r"""..."""  
    ...  
    @cell_reuse()  
    def _init_(self, config: WizardCoderConfig = None):  
        config = config if configis not None else WizardCoderConfig()  
        super(WizardCoderLMHeadModel,self).__init__(config, auto_prefix=True)

【注意】:MindSpore2.2.0版本,cell_reuse装饰器的写法有变化,不需要后面的括号。

@MindFormerRegister.register(MindFormerModuleType. MODELS)  
class WizardCoderLMHeadModel (BaseModel):  
    r"""..."""  
    ...  
    @cell_reuse  
    def _init_(self, config: WizardCoderConfig = None):  
        config = config if configis not None else WizardCoderConfig()  
        super(WizardCoderLMHeadModel,self).__init__(config, auto_prefix=True)