MindSpore模型报错Reason: Memory resources are exhausted.

1. 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=xxx
执行模式(PyNative/ Graph):xxx
Python版本: Python=3.9.13
操作系统平台: linux

2. 报错信息

2.1 问题描述

File "/home/ma-user/modelarts/user-job-dir/train.py", line 309, in run_train_no_pipeline  
    model.train(actual_epoch_num, ds, callbacks=callback, sink_size=args_opt.sink_size, dataset_sink_mode=True)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 1065, in train  
    initial_epoch=initial_epoch)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 113, in wrapper  
    func(self, *args, **kwargs)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 619, in _train  
    cb_params, sink_size, initial_epoch, valid_infos)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 702, in _train_dataset_sink_process  
    outputs = train_network(*inputs)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 641, in __call__  
    out = self.compile_and_run(*args, **kwargs)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 964, in compile_and_run  
    return _cell_graph_executor(self, *new_args, phase=self.phase)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1548, in __call__  
    return self.run(obj, *args, phase=phase)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1587, in run  
    return self._exec_pip(obj, *args, phase=phase_real)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 110, in wrapper  
    results = fn(*arg, **kwargs)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1567, in _exec_pip  
    return self._graph_executor(args, phase)  
RuntimeError: Exec graph failed  
----------------------------------------------------  
- Ascend Error Message:  
----------------------------------------------------  
EI0007: Failed to allocate resource[qp] with info [rdmaHandle:187651097805232flag:0qpMode:3281438645626752]. Reason: Memory resources are exhausted.  
        Possible Cause: Failed to allocate memory or the Notify register due to resource insufficiency.  
        TraceBack (most recent call last):  
        Call ops_kernel_info_store LoadTask fail[FUNC:Distribute][FILE:hccl_task_info.cc][LINE:320]  
        Call ops_kernel_info_store unloadTask fail[FUNC:Release][FILE:hccl_task_info.cc][LINE:657]  
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)  
----------------------------------------------------  
- C++ Call Stack: (For framework developers)  
----------------------------------------------------  
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:997 RunGraph

3. 根因分析

内存分配失败,这里指的是hccl所需内存不够,导致失败,

4. 解决方案

这里max_device_memory设置过大导致系统分配给hccl的内存不够,尝试减小max_device_memory参数的值。

context.set_context(max_device_memory=max_device_memory)