1. 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本:2.2
执行模式:静态图
Python版本:3.8
操作系统平台:Linux
2. 报错信息
2.1 问题描述
MindSpore盘古模型报错Failed to allocate memory.Possible Cause: Available memory is insufficient.
2.2 报错信息
- Ascend Error Message:
----------------------------------------------------
EL0004: Failed to allocate memory.
Possible Cause: Available memory is insufficient.
或者是
The Free Device Memory Size is 28.2318 GB, variable memory_max_size/max_device_memory should be in range (0-28909.4]MB, but got 30720MB, please set the context key ‘variable memory_max_size’/‘max_device_memory’ in valid range.
3. 根因分析
1、推动云平台升级升级driver至23.0rc3以上(记得各种包配套问题),可降低hccl占用的显存。
2、主函数python修改context,或修改mindformers中yaml配置文件的max_device_memory配置,尝试调小分配给MindSpore的显存,留更多显存给HCCL/Driver
context.set_context(max_device_memory="28G") # 910显存设置,B4可尝试26G等
context.set_context(max_device_memory="57G") # 910显存设置,B3可尝试57G等
3、启动脚本中添加以下环境变量
echo "--------------npu-smi info-----------"
npu-smi info
export MS_DEV_CELL_REUSE=1 #开启cell共享
或只增加该两项环境变量,pangu-sigma 38b尝试可行。
export MS_ENABLE_FORMAT_MODE=1
export MS_MEMORY_STATISTIC=1
如果所有的方法都没用,不断降低bs,和网络layers,和max_device_memory,发现最终报错的算子在hccl侧显存不够。
E19999: Inner Error!
E19999 Call ops_kernel_info_store LoadTask fai1[FUNC: Distribute][FILE: hcc1_task_info. cc][LINE: 320]
TraceBack (most recent call last):
Call ops kernel_info store unloadltask fai1LFUNC: ReleaseJLFILE:hccl_task_info. ccl[LINE: 657|
GraphManager RunGrapWithStrteamhAsync failed, session id= 0, graph id = 286, stream= Oxaaaae871edd0. [FUNC: RunGraphWithStreamAsync][FILE:inner_session. cc][LINE: 516]
[Run][Graph]Run graph with stream asyn failed, error code -1343225860, session id = 0,graph id = 286, stream = Oxaaaae871edd0. [FUNC: RunGraphWithStreamAsync][FILE: ge_api. cc][LINE: 774]
(Please search "Ascend Error Message" at https://www. mindspore.cn for error code description)