并行策略为8:1:1时报错RuntimeError: May you need to check if the batch size etc. in your 'net' and 'parameter dict' are same.

Skyti · 2025 年10 月 4 日 17:35

1. 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.2.0
执行模式（PyNative/ Graph）: 不限

2. 报错信息

2.1 问题描述

在只开数据并行的时候，出现报错。

2.2 报错信息

RuntimeError: For 'load_pa ram_into_net', backbone.blocks .0.attention.project ion.weight in the argument 'net' should have thesame shape as backbone.blocks .0.attention.projection.weight in the argument 'parameter dict' But got its shape (768, 6144) in the argument 'net' and shape (6144, 6144) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loadod is correct or the batch size and so on in the 'net' and 'parameter dict' are same.

3. 根因分析

分析可知，在数据并行的时候也自动开启了模型切分，这是因为配置参数optimizer_shard设为了True，将其设为False后。即可关闭数据并行模式下的模型自动切分。

4. 解决方案

如果想要使用optimizer_shard，则在数据并行模式下也需要先进行模型切分。
重点： 在设置上述参数后发现，loss_scale出现很大问题（正常情况溢出loss_scale最低到1就停止降低，但是当前出现异常loss_scale更新到unavailable。

只有数据并行时，将parallel_mode设为0（即数据并行模式），此时optimizer_shard设为True也不影响使用，full_batch参数必须设为False，gradients_mean建议设为True。

话题	回复	浏览量
MindSpore分布式并行报错The strategy is XXX, shape XXX cannot be divisible by strategy value XXX 分布式并行-Distributed Parallelsim	11	2025 年9 月 25 日
MindSpore开启profile，使用并行策略报错ValueError: When dıstrıbuted loads are slıced we1ghts, sınk mode must be set True. 分布式并行-Distributed Parallelsim	12	2025 年9 月 28 日
MindSpore跑模型并行报错ValueError: array split does not result in an equal division 分布式并行-Distributed Parallelsim	9	2025 年9 月 29 日
模型并行策略为 1:1:8 时报错RuntimeError: Stage num is 8 is not equal to stage used: 5 分布式并行-Distributed Parallelsim	12	2025 年9 月 27 日
MindSpoer报错：The strategy is ((6, 4), (4,6)), the value of stategy must be the power of 2, but get 6. 分布式并行-Distributed Parallelsim	15	2025 年7 月 25 日