昇腾上moxing拷贝超时导致Notify超时

环境信息

  • 硬件环境(Ascend/GPU/CPU): Ascend 910B
  • MindSpore版本: 2.2.0/2.1.0
  • 执行模式(PyNative/ Graph):Graph
  • Python版本: 不限
  • 操作系统平台: linux

报错信息

- Ascend Error Message:
EE9999: Inner Error!
EE9999 The error from device(chipld:0,dield:0), serial number is 1, notify wait timeout occurredduring task execution, stream id:28, sq id:28,task_id:2, notify_id=5, timeout=1836.[FUNC:ProcessStarsWaitTimeoutErrorlnfo][FILE:device_error _ proc.cc][LINE:1308]
	TraceBack (most recent call last)
	Notify wait execute failed, device id=0, stream_ id=28,task_id=2, fip_ num=0, notify_id=5[FUNC:GetEnor][FILE:stream.cc][LINE:1483]
	rtStreamSynchronizeWithTimeout execute failed,reason=[the model stream executefailed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
	synchronize stream failed, runtime result = 507011[FUNC ReporCallEror][FILE:log_ inner.cpp][LINE:161)

(Please search "Ascend Error Message" athttps:/www.mindspore.cn for error code description)
- C++ Call Stack. (For framework developers)
mindspore/ccsrc/runtime/graph schedulerlgraphscheduler.cc:679 Run

解决方案

  • 中间过程notify超时。查看日志,发现报错是在start upload output file to obs后,查看mindformers代码后,该日志是通过moxing上传ckpt等输出文件到obs,定位后发现上传obs时间过长,超过HCCL等待时间,阻塞了网络训练导致notify等待超时

解决方案:

1、推动云道定位moxing上传过慢问题

规避方式: 不手动调用moxing上传,通过修改ckpt保存路径到云道支持的output路径,可以实现自动上传

step1:修改mindformers config文件,删除ObsMonitor

# callbacks
callbacks:
	- tуpе: MFLossMonitor
	- type: CheckpointMointor
	  prefix: "1lama 70b"
	  save_checkpoint_steps: 1000
	  integrated_save: False
	  async_save: False
	- tуре: ObsMonitor

step2: 修改默认ckpt保存路径:mindformers/tools/utils.py, 修改MA_OUTPUT_PATH为: /home/ma-user/modelarts/outputs/train_url_0

MA_OUTPUT_ROOT = '/cache/ma-user-work'

使用的前提是,拉起任务时有设置output路径。设置后,训练生成的文件会被自动上传到输出数据的obs路径,上传文件和模型训练过程解耦。