MindSpore盘古模型报错Failed to allocate memory.Possible Cause: Available memory is insufficient.

1. 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本:2.2
执行模式:静态图
Python版本:3.8
操作系统平台:Linux

2. 报错信息

2.1 问题描述

MindSpore盘古模型报错Failed to allocate memory.Possible Cause: Available memory is insufficient.

2.2 报错信息

- Ascend Error Message:  
----------------------------------------------------  
EL0004: Failed to allocate memory.  
        Possible Cause: Available memory is insufficient.

或者是

The Free Device Memory Size is 28.2318 GB, variable memory_max_size/max_device_memory should be in range (0-28909.4]MB, but got 30720MB, please set the context key ‘variable memory_max_size’/‘max_device_memory’ in valid range.

3. 根因分析

1、推动云平台升级升级driver至23.0rc3以上(记得各种包配套问题),可降低hccl占用的显存。
2、主函数python修改context,或修改mindformers中yaml配置文件的max_device_memory配置,尝试调小分配给MindSpore的显存,留更多显存给HCCL/Driver

context.set_context(max_device_memory="28G") # 910显存设置,B4可尝试26G等  
context.set_context(max_device_memory="57G") # 910显存设置,B3可尝试57G等

3、启动脚本中添加以下环境变量

echo "--------------npu-smi info-----------"  
npu-smi info  
export MS_DEV_CELL_REUSE=1   #开启cell共享

或只增加该两项环境变量,pangu-sigma 38b尝试可行。

export MS_ENABLE_FORMAT_MODE=1  
export MS_MEMORY_STATISTIC=1

如果所有的方法都没用,不断降低bs,和网络layers,和max_device_memory,发现最终报错的算子在hccl侧显存不够。

E19999: Inner Error!  
E19999  Call ops_kernel_info_store LoadTask fai1[FUNC: Distribute][FILE: hcc1_task_info. cc][LINE: 320]  
            TraceBack (most recent call last):  
            Call ops kernel_info store unloadltask fai1LFUNC: ReleaseJLFILE:hccl_task_info. ccl[LINE: 657|  
            GraphManager RunGrapWithStrteamhAsync failed, session id= 0, graph id = 286, stream= Oxaaaae871edd0. [FUNC: RunGraphWithStreamAsync][FILE:inner_session. cc][LINE: 516]  
            [Run][Graph]Run graph with stream asyn failed, error code -1343225860, session id = 0,graph id = 286, stream = Oxaaaae871edd0. [FUNC: RunGraphWithStreamAsync][FILE: ge_api. cc][LINE: 774]  
(Please search "Ascend Error Message" at https://www. mindspore.cn for error code description)