1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend/GPU/CPU
MindSpore版本: mindspore=2.0.0
执行模式(PyNative/ Graph):不限
Python版本: Python=3.9
操作系统平台: 不限
2 报错信息
进行分布式配置后,运行脚本出现如下你报错
RuntimeError: Preprocess failed before run graph 1.
----------------------------------------------------
- Framework Error Message:
----------------------------------------------------
Out of Memory!!! Request memory size: 13518844928B, Memory Statistic:
Device HBM memory size: 32768M
MindSpore Used memory size: 30720M
MindSpore memory base address: 0x124180000000
Total Static Memory size: 19792M
Total Dynamic memory size: 0M
Dynamic memory size of this graph: 0M
Please try to reduce 'batch_size' or check whether exists extra large shape. For more details, please refer to 'Out of Memory' at https://www.mindspore.cn .
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_kernel_executor.cc:244 PreprocessBeforeRunGraph
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_adapter.cc:169 MallocDynamicDevMem
3 根因分析
Please try to reduce 'batch_size' or check whether exists extra large shape. For more details, please refer to 'Out of Memory' at https://www.mindspore.cn .
根据这里报错信息提示,是本地的硬件环境的限制,应该减小’batch_size’,这样减小一下每次都计算。
4 解决方案
- 启动程序之前先
npu-smi info看下分布式使用的卡是否被其他程序占用,确保每张卡都没有程序占用再启动 - 把
batch size降低1试一下是否能跑起来, 如果代码中指代的batch size是全局batch size=8,device_num=8的时候,单卡的batch size就已经是1了,还跑不起来就说明机器配置有限跑不了当前的代码。