1. 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=xxx
执行模式(PyNative/ Graph):xxx
Python版本: Python=3.9.13
操作系统平台: linux
2. 报错信息
2.1 问题描述
File "/home/ma-user/modelarts/user-job-dir/train.py", line 309, in run_train_no_pipeline
model.train(actual_epoch_num, ds, callbacks=callback, sink_size=args_opt.sink_size, dataset_sink_mode=True)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 1065, in train
initial_epoch=initial_epoch)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 113, in wrapper
func(self, *args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 619, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 702, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 641, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 964, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1548, in __call__
return self.run(obj, *args, phase=phase)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1587, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 110, in wrapper
results = fn(*arg, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1567, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Exec graph failed
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EI0007: Failed to allocate resource[qp] with info [rdmaHandle:187651097805232flag:0qpMode:3281438645626752]. Reason: Memory resources are exhausted.
Possible Cause: Failed to allocate memory or the Notify register due to resource insufficiency.
TraceBack (most recent call last):
Call ops_kernel_info_store LoadTask fail[FUNC:Distribute][FILE:hccl_task_info.cc][LINE:320]
Call ops_kernel_info_store unloadTask fail[FUNC:Release][FILE:hccl_task_info.cc][LINE:657]
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:997 RunGraph
3. 根因分析
内存分配失败,这里指的是hccl所需内存不够,导致失败,
4. 解决方案
这里max_device_memory设置过大导致系统分配给hccl的内存不够,尝试减小max_device_memory参数的值。
context.set_context(max_device_memory=max_device_memory)