【案例】训练过程中评测超时导致训练过程发生中断

Hanshize · 2025 年9 月 24 日 03:08

【问题描述】

客户反馈4机32rank训练过程中发生中断，需要分析原因。
训练打屏日志提示：

[ERROR] DEVICE(192563,ffff90722010,python):2025-03-18-20:50:24.716.326 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_manager.cc:241] SyncStream] Call runtime aclrtSynchronizeStreamWithTimeout error.
\----------------------------------------------------
\- Ascend Error Message:
\----------------------------------------------------
EE9999: Inner Error!
EE9999 The error from device(chipId:3, dieId:0), serial number is 1, event wait timeout occurred during task execution, stream_id:122, sq_id:122, task_id:790, event_id=73, timeout=1980.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1314]
TraceBack (most recent call last):
Task execute failed, device_id=3, stream_id=122, task_id=790, flip_num=0, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1483]
rtStreamSynchronizeWithTimeout execute failed, reason=[the model stream execute failed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
synchronize stream failed, runtime result = 507011[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

【版本信息】

MindSpore 2.3.0 RC1
Python : 3.7

【根因】

客户的模型在训练一定数量的epoch后，会进行权重文件的评测，期间各个卡的通信停止，
HCCL_EXEC_TIMEOUT默认值1836，如果评测时间超过1836s，就会导致各卡的HCCL连接发生超时。

【处理过程】

客户采用4机32rank来进行训练，分析plog日志，可以看到每个服务器的后7个rank的plog日志都打印类似的错误：

ReportErrorInfoForModelExecuteTask:model execute error, retCode=0x91, [the model stream execute failed].
ProcessStarsWaitTimeoutErrorInfo:The error from device(chipId:2, dieId:0), serial number is 1, event wait timeout occurred during task execution, stream_id:122, sq_id:122, task_id:790, event_id=73, timeout=1980.
ProcessStarsWaitTimeoutErrorInfo:report error module_type=6, module_name=EE9999

判断并不是某个单点的rank的问题导致的问题，进一步分析训练日志。
从训练日志中可以看到在第170个epoch结束后，开始进行评测任务（客户自己定制的特性）

然后评测未完全完成时发生错误

从评测开始到发生错误，时间差不多是30分钟。

刚好在在plog报错前的日志对应的时间段可以看到下面的记录，也是在首次打印ReportTimeoutProc: report timeout!的地方后30分钟发生notify wait的timeout错误。
这个时间也与下面错误信息中的timeout=1980基本吻合

EE9999 The error from device(chipId:3, dieId:0), serial number is 1, event wait timeout occurred during task execution, stream_id:122, sq_id:122, task_id:790, event_id=73, timeout=1980.[FUNC:ProcessStarsWaitTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1314]

Notify wait的错误是因为HCCL超时，超时时间是通过HCCL_CONNECT_TIMEOUT和HCCL_EXEC_TIMEOUT来控制的。检查运行环境的HCCL_CONNECT_TIMEOUT和HCCL_EXEC_TIMEOUT配置，运行环境变量中并没有HCCL_CONNECT_TIMEOUT(默认时间60)和HCCL_EXEC_TIMEOUT(默认时间1800（秒）)。
基于上面的分析，原因判断为评测时间超过了HCCL的超时时间。

【解决方案】

配置HCCL_CONNECT_TIMEOUT和HCCL_EXEC_TIMEOUT参数，增加超时时间，减少因为评测任务执行时间长导致的超时失败

话题	回复	浏览量
Ascend多卡训练报错davinci_model : load task fail, return ret xxx 分布式并行-Distributed Parallelsim	47	2025 年7 月 25 日
微调qwen3-32B大模型，单机多卡信号同步失败 Sync run failed及解决调优经验-Tuning Experience	42	2025 年11 月 30 日
微调qwen3-32B大模型，单机多卡信号同步失败 Sync run failed 分布式并行-Distributed Parallelsim	64	2025 年11 月 21 日
MindSpore分布式8节点报错Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295 分布式并行-Distributed Parallelsim	33	2025 年9 月 24 日
[LoadTask] Distribute Task Failed 报错解决模型训练-Model Training	43	2025 年9 月 11 日