1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=2.3.0
执行模式(PyNative/ Graph): GRAPH
Python版本: Python=3.9.13
操作系统平台: linux
2 报错信息
2.1 问题描述
单机8卡7B参数量大小的模型,数据并行,报以下错误:
[ERROR] GE_ADPT(2032,fffcc57fd1e0,python):2024-01-26-18:16:34.140.365 [mindspore/ccsrc/transform/graph_ir/graph_runner.cc:352] RunGraphWithStreamAsync] Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295
2024-01-26 18:16:34,262 - mindformers[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
result = run_func(*args, **kwargs)
File "/home/ma-user/modelarts/user-job-dir/mindformers/run_mindformer.py", line 130, in main
trainer.train(config, is_full_config=True)
File "/home/ma-user/modelarts/user-job-dir/mindformers/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 104, in train
**kwargs)
File "/home/ma-user/modelarts/user-job-dir/mindformers/mindformers/trainer/base_trainer.py", line 738, in training_process
initial_epoch=config.runner_config.initial_epoch)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 1073, in train
initial_epoch=initial_epoch)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 624, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 680, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 1023, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1589, in __call__
return self.run(obj, *args, phase=phase)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1628, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 121, in wrapper
results = fn(*arg, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1608, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Exec graph failed
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EL0004: Failed to allocate memory.
Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
Transport init error. Reason: [Create][DestLink]Create Dest error! creakLink para:rank[7]-localUserrank[7]-localIpAddr[192.168.66.9], dst_rank[3]-remoteUserrank[3]-remote_ip_addr[192.168.66.9]
根据以上信息的提示是: Failed to allocate memory. 大部分用户会以为是bs过大导致显存不足,但是当bs设置为1 时依然报以上错误,
3 根因分析
当将bs设置到最低的时候,还是显存不足,这时我们应该考虑到是否是因为系统给hccl分配的显存不足导致挂掉,我们可以尝试减小
max_device_memory参数,显存不足的问题解决。
4 解决方案:
减小max_device_memory参数值