MindSpore报错Please try to reduce 'batch_size' or check whether exists extra large shape.方法二

chengxiaoli · 2025 年10 月 21 日 08:02

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend/GPU/CPU
MindSpore版本: mindspore=2.0.0
执行模式（PyNative/ Graph）:不限
Python版本: Python=3.9
操作系统平台: 不限

2 报错信息

进行分布式配置后，运行脚本出现如下报错

RuntimeError: Preprocess failed before run graph 1.
----------------------------------------------------
- Framework Error Message:
----------------------------------------------------
Out of Memory!!! Request memory size: 13518844928B, Memory Statistic:
Device HBM memory size: 32768M
MindSpore Used memory size: 30720M
MindSpore memory base address: 0x124180000000
Total Static Memory size: 19792M
Total Dynamic memory size: 0M
Dynamic memory size of this graph: 0M
Please try to reduce 'batch_size' or check whether exists extra large shape. For more details, please refer to 'Out of Memory' at https://www.mindspore.cn .
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_kernel_executor.cc:244 PreprocessBeforeRunGraph
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_adapter.cc:169 MallocDynamicDevMem

3 根因分析

没有执行代码，根据报错信息提示的是OOM，申请内存大于实际内存

4 解决方案

减小申请的内存，可在分布式设置之前添加下面代码

from mindspore import context
context.set_context(max_device_memory="25GB")

话题	回复	浏览量
MindSpore报错Please try to reduce 'batch_size' or check whether exists extra large shape. 模型训练-Model Training	28	2025 年10 月 21 日
MindSpore数据并行报错Call GE RunGraphWithStreamAsync Failed，EL0004: Failed to allocate memory. 数据加载及处理-Data Loading&Processing	16	2025 年10 月 10 日
单机4卡分布式推理失报错RuntimeError: Ascend kernel runtime initialization failed. The details refer to 'Ascend Error Message'. 分布式并行-Distributed Parallelsim	25	2025 年10 月 24 日
qwen-7b全参微调报错RuntimeError: Preprocess failed before run graph 1. 模型训练-Model Training	39	2025 年8 月 16 日
并行策略为8:1:1时报错RuntimeError: May you need to check if the batch size etc. in your 'net' and 'parameter dict' are same. 分布式并行-Distributed Parallelsim	16	2025 年10 月 4 日

MindSpore报错Please try to reduce 'batch_size' or check whether exists extra large shape.方法二

1 系统环境

2 报错信息

3 根因分析

4 解决方案

相关话题