1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend/GPU/CPU
MindSpore版本: mindspore=2.0.0
执行模式(PyNative/ Graph):不限
Python版本: Python=3.9
操作系统平台: 不限
2 报错信息
进行分布式配置后,运行脚本出现如下报错
RuntimeError: Preprocess failed before run graph 1.
----------------------------------------------------
- Framework Error Message:
----------------------------------------------------
Out of Memory!!! Request memory size: 13518844928B, Memory Statistic:
Device HBM memory size: 32768M
MindSpore Used memory size: 30720M
MindSpore memory base address: 0x124180000000
Total Static Memory size: 19792M
Total Dynamic memory size: 0M
Dynamic memory size of this graph: 0M
Please try to reduce 'batch_size' or check whether exists extra large shape. For more details, please refer to 'Out of Memory' at https://www.mindspore.cn .
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_kernel_executor.cc:244 PreprocessBeforeRunGraph
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_adapter.cc:169 MallocDynamicDevMem
3 根因分析
没有执行代码,根据报错信息提示的是OOM,申请内存大于实际内存
4 解决方案
减小申请的内存,可在分布式设置之前添加下面代码
from mindspore import context
context.set_context(max_device_memory="25GB")