昇腾上moxing拷贝超时导致Notify超时

Skyti · 2025 年10 月 4 日 12:47

1. 环境信息

硬件环境(Ascend/GPU/CPU): Ascend 910B
MindSpore版本: 2.2.0/2.1.0
执行模式（PyNative/ Graph）:Graph
MF版本：r1.0
Python版本: 不限
操作系统平台: linux

报错信息

- Ascend Error Message:
EE9999: Inner Error!
EE9999 The error from device(chipld:0,dield:0), serial number is 1, notify wait timeout occurredduring task execution, stream id:28, sq id:28,task_id:2, notify_id=5, timeout=1836.[FUNC:ProcessStarsWaitTimeoutErrorlnfo][FILE:device_error _ proc.cc][LINE:1308]
	TraceBack (most recent call last)
	Notify wait execute failed, device id=0, stream_ id=28,task_id=2, fip_ num=0, notify_id=5[FUNC:GetEnor][FILE:stream.cc][LINE:1483]
	rtStreamSynchronizeWithTimeout execute failed,reason=[the model stream executefailed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
	synchronize stream failed, runtime result = 507011[FUNC ReporCallEror][FILE:log_ inner.cpp][LINE:161)

(Please search "Ascend Error Message" athttps:/www.mindspore.cn for error code description)
- C++ Call Stack. (For framework developers)
mindspore/ccsrc/runtime/graph schedulerlgraphscheduler.cc:679 Run

2. 解决方案

中间过程notify超时。查看日志，发现报错是在start upload output file to obs后，查看mindformers代码后，该日志是通过moxing上传ckpt等输出文件到obs，定位后发现上传obs时间过长，超过HCCL等待时间，阻塞了网络训练导致notify等待超时

解决方案：
1、推动云道定位moxing上传过慢问题
规避方式: 不手动调用moxing上传，通过修改ckpt保存路径到云道支持的output路径，可以实现自动上传
step1：修改mindformers config文件，删除ObsMonitor

# callbacks
callbacks:
	- tуpе: MFLossMonitor
	- type: CheckpointMointor
	  prefix: "1lama 70b"
	  save_checkpoint_steps: 1000
	  integrated_save: False
	  async_save: False
	- tуре: ObsMonitor

step2: 修改默认ckpt保存路径：mindformers/tools/utils.py, 修改MA_OUTPUT_PATH为: /home/ma-user/modelarts/outputs/train_url_0

MA_OUTPUT_ROOT = '/cache/ma-user-work'

使用的前提是，拉起任务时有设置output路径。设置后，训练生成的文件会被自动上传到输出数据的obs路径，上传文件和模型训练过程解耦。

话题		回复	浏览量
【案例】训练过程中评测超时导致训练过程发生中断模型训练-Model Training	0	26	2025 年9 月 24 日
微调qwen3-32B大模型，单机多卡信号同步失败 Sync run failed 分布式并行-Distributed Parallelsim	0	24	2025 年11 月 21 日
微调qwen3-32B大模型，单机多卡信号同步失败 Sync run failed及解决调优经验-Tuning Experience	0	12	2025 年11 月 30 日
MindSpore Transformers 1.5.0 安装报错求助！问题求助 Help 安装 , bug	5	59	2025 年12 月 4 日
Ascend多卡训练报错davinci_model : load task fail, return ret xxx 分布式并行-Distributed Parallelsim	0	26	2025 年7 月 25 日

昇腾上moxing拷贝超时导致Notify超时

1. 环境信息

报错信息

2. 解决方案

相关话题