模型并行策略为 1:1:8 时报错RuntimeError: Stage num is 8 is not equal to stage used: 5

huan666 · 2025 年9 月 27 日 22:21

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.2.0
执行模式（PyNative/ Graph）: 不限

2 报错信息

2.1 问题描述

模型跑4层网络，设置并行策略为dp:mp:pp=1:1:8，出现报错

2.2 报错信息

Traceback (most recent call last):  
  File "wizardcoder/run_wizardcode r.py", line 148, in <module>  
    device_id=args.device_id)  
  File "wizardcoder/run_wizardcoder.py", line 90, in main  
    task. finetune(finetune_checkpoint=config.load_checkpoint • auto_trans_ckpt=config .auto_trans_ckpt, resume=resume)  
  File "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/trainer/trainer.py", Tine 522, in finetune  
    is_full_config=True, **kwargs  
  File"/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/traine r/caus al_language_modeling/caus al_language_modeling.py", line 106, in train  
    **kwargs)  
  File "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/traine r/base_trainer.py", line 616, in training_process  
    transform_and_load_checkpoint (config, model, network, dataset)  
  File "/home/wizardcoder/1_wizardc oder-mindformers-916/mindforme rs/trainer/utils.py", line 300, in transform_ and_load_checkpoint  
    build_model(config, model, dataset, do_eval=do_eval, do_predict=do_predict)  
  File "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/trainer/utils.py", line 330, in build_model  
    sink_size=config.runner_config.sink_size)  
  File "/root/miniconda3/envs/wiz ardcoder/lib/python3.7/site-packages/mindspore/train/model .py", line 1263, in build  
    self._init(train_dataset , valid_dataset, sink_siz e, epoch)  
  File "/root/miniconda3/envs/wiza rdcoder/lib/python3.1/site-packages /mindspore/train/model.py", line 524, in _init  
    train_network.compile( *inputs)  
  File "/root/miniconda3/envs/wizardcoder/lib/python3 .7/site-packages/mindspore/nn/cell .py", line 939, in compile  
    jit_config_dict=self._jit_config_dict, *compile_args, **kwargs)  
  File n/root/miniconda3/envs7wiza rdcoder/lib/python3.7/site-packages /mindspore/common/api.py", line 1623, in compile  
    result = self. araoh executor.comoilelobi. aras. kwaras, phase, self. use vm mode()   
RuntimeError: Stage num is 8 is not equal to stage used: 5

3 根因分析

这是因为模型层数只有4层，无法进行pipeline=8的分层切割。

4 解决方案

需要满足pipeline_stage小于等于num_layers这一条件。

话题	回复	浏览量
MindSpore模型Pipeline并行训练报错RuntimeError: Stage 0 should has at least 1 parameter. but got none. 分布式并行-Distributed Parallelsim	25	2025 年10 月 11 日
并行策略为8:1:1时报错RuntimeError: May you need to check if the batch size etc. in your 'net' and 'parameter dict' are same. 分布式并行-Distributed Parallelsim	19	2025 年10 月 4 日
MindSpoer报错：The strategy is ((6, 4), (4,6)), the value of stategy must be the power of 2, but get 6. 分布式并行-Distributed Parallelsim	36	2025 年7 月 25 日
MindSpore分布式并行报错The strategy is XXX, shape XXX cannot be divisible by strategy value XXX 分布式并行-Distributed Parallelsim	20	2025 年9 月 25 日
MindSpore开启profile，使用并行策略报错ValueError: When dıstrıbuted loads are slıced we1ghts, sınk mode must be set True. 分布式并行-Distributed Parallelsim	31	2025 年9 月 28 日

模型并行策略为 1:1:8 时报错RuntimeError: Stage num is 8 is not equal to stage used: 5

1 系统环境

2 报错信息

2.1 问题描述

2.2 报错信息

3 根因分析

4 解决方案

相关话题