MindSpore数据并行报错Call GE RunGraphWithStreamAsync Failed,EL0004: Failed to allocate memory.

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=2.3.0
执行模式(PyNative/ Graph): GRAPH
Python版本: Python=3.9.13
操作系统平台: linux

2 报错信息

2.1 问题描述

单机8卡7B参数量大小的模型,数据并行,报以下错误:

[ERROR] GE_ADPT(2032,fffcc57fd1e0,python):2024-01-26-18:16:34.140.365 [mindspore/ccsrc/transform/graph_ir/graph_runner.cc:352] RunGraphWithStreamAsync] Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295  
2024-01-26 18:16:34,262 - mindformers[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last):  
  File "/home/ma-user/modelarts/user-job-dir/mindformers/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper  
    result = run_func(*args, **kwargs)  
  File "/home/ma-user/modelarts/user-job-dir/mindformers/run_mindformer.py", line 130, in main  
    trainer.train(config, is_full_config=True)  
  File "/home/ma-user/modelarts/user-job-dir/mindformers/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 104, in train  
    **kwargs)  
  File "/home/ma-user/modelarts/user-job-dir/mindformers/mindformers/trainer/base_trainer.py", line 738, in training_process  
    initial_epoch=config.runner_config.initial_epoch)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 1073, in train  
    initial_epoch=initial_epoch)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 114, in wrapper  
    func(self, *args, **kwargs)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 624, in _train  
    cb_params, sink_size, initial_epoch, valid_infos)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process  
    outputs = train_network(*inputs)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 680, in __call__  
    out = self.compile_and_run(*args, **kwargs)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 1023, in compile_and_run  
    return _cell_graph_executor(self, *new_args, phase=self.phase)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1589, in __call__  
    return self.run(obj, *args, phase=phase)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1628, in run  
    return self._exec_pip(obj, *args, phase=phase_real)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 121, in wrapper  
    results = fn(*arg, **kwargs)  
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1608, in _exec_pip  
    return self._graph_executor(args, phase)  
RuntimeError: Exec graph failed  
----------------------------------------------------  
- Ascend Error Message:  
----------------------------------------------------  
EL0004: Failed to allocate memory.  
        Possible Cause: Available memory is insufficient.  
        Solution: Close applications not in use.  
        TraceBack (most recent call last):  
        Transport init error. Reason: [Create][DestLink]Create Dest error! creakLink para:rank[7]-localUserrank[7]-localIpAddr[192.168.66.9], dst_rank[3]-remoteUserrank[3]-remote_ip_addr[192.168.66.9]

根据以上信息的提示是: Failed to allocate memory. 大部分用户会以为是bs过大导致显存不足,但是当bs设置为1 时依然报以上错误,

3 根因分析

当将bs设置到最低的时候,还是显存不足,这时我们应该考虑到是否是因为系统给hccl分配的显存不足导致挂掉,我们可以尝试减小
max_device_memory参数,显存不足的问题解决。

4 解决方案:

减小max_device_memory参数值