单机4卡分布式推理失报错RuntimeError: Ascend kernel runtime initialization failed. The details refer to 'Ascend Error Message'.

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 不限
执行模式(PyNative/ Graph): 不限

2 报错信息

README https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek/deepseek.md
单机4卡分布式推理失败,出现out of memory报错。报错信息如下

[WARNING] DISTRIBUTED(2517382,ffff92cf20b0,python):2024-05-08-15:44:00.654.709 [mindspore/ccsrc/distributed/rpc/tcp/tcp_comm.cc:464] Connect] Waiting for the state of the connection to 127.0.0.1:8118 to be connected...Retry number: 1
[WARNING] DISTRIBUTED(2517382,ffff92cf20b0,python):2024-05-08-15:44:01.659.911 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:194] BuildCluster] Topology build timed out., retry(1/200).
[WARNING] DISTRIBUTED(2517382,ffff92cf20b0,python):2024-05-08-15:44:04.660.169 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:196] BuildCluster] Cluster is successfully initialized.
[WARNING] DISTRIBUTED(2517382,ffff92cf20b0,python):2024-05-08-15:44:04.660.685 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:260] PostProcess] This node 0 rank id: 0
[WARNING] DEVICE(2517382,ffff92cf20b0,python):2024-05-08-15:44:04.850.464 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_adapter.cc:95] Initialize] Reserved memory size for other components(1040187392) is less than recommend size(4073433600), It may lead to Out Of Memory in HCCL or other components, Please double check context key 'variable_memory_max_size'/'max_device_memory'
2024-05-08 15:44:44,925 - mindformers[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last):
  File "/home/jenkins0/csj/mindformers/mindformers/core/context/build_context.py", line 120, in init_context
    init()
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/communication/management.py", line 188, in init
    init_hccl()
RuntimeError: Ascend kernel runtime initialization failed. The details refer to 'Ascend Error Message'.

----------------------------------------------------
- Framework Error Message:
----------------------------------------------------
Malloc device memory failed, size[64424509440], ret[207001], Device 0 Available HBM size:65464696832 free size:65165885440 may be other processes occupying this card, check as: ps -ef|grep python

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EL0004: 2024-05-08-15:44:08.292.766 Failed to allocate memory.
        Possible Cause: Available memory is insufficient.
        Solution: Close applications not in use.
        TraceBack (most recent call last):
        rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

3 解决方案

实际使用的内存大小超出了设备预先设置的可用内存大小,可以修改yaml参数max_device_memory,提升设备预设可用内存。