环境信息
- 硬件环境(Ascend/GPU/CPU): Ascend 910B
- MindSpore版本: 2.2.0/2.1.0
- 执行模式(PyNative/ Graph):Graph
- Python版本: 不限
- 操作系统平台: linux
报错信息
- Ascend Error Message:
EE9999: Inner Error!
EE9999 The error from device(chipld:0,dield:0), serial number is 1, notify wait timeout occurredduring task execution, stream id:28, sq id:28,task_id:2, notify_id=5, timeout=1836.[FUNC:ProcessStarsWaitTimeoutErrorlnfo][FILE:device_error _ proc.cc][LINE:1308]
TraceBack (most recent call last)
Notify wait execute failed, device id=0, stream_ id=28,task_id=2, fip_ num=0, notify_id=5[FUNC:GetEnor][FILE:stream.cc][LINE:1483]
rtStreamSynchronizeWithTimeout execute failed,reason=[the model stream executefailed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
synchronize stream failed, runtime result = 507011[FUNC ReporCallEror][FILE:log_ inner.cpp][LINE:161)
(Please search "Ascend Error Message" athttps:/www.mindspore.cn for error code description)
- C++ Call Stack. (For framework developers)
mindspore/ccsrc/runtime/graph schedulerlgraphscheduler.cc:679 Run
解决方案
- 中间过程notify超时。查看日志,发现报错是在start upload output file to obs后,查看mindformers代码后,该日志是通过moxing上传ckpt等输出文件到obs,定位后发现上传obs时间过长,超过HCCL等待时间,阻塞了网络训练导致notify等待超时
解决方案:
1、推动云道定位moxing上传过慢问题
规避方式: 不手动调用moxing上传,通过修改ckpt保存路径到云道支持的output路径,可以实现自动上传
step1:修改mindformers config文件,删除ObsMonitor
# callbacks
callbacks:
- tуpе: MFLossMonitor
- type: CheckpointMointor
prefix: "1lama 70b"
save_checkpoint_steps: 1000
integrated_save: False
async_save: False
- tуре: ObsMonitor
step2: 修改默认ckpt保存路径:mindformers/tools/utils.py, 修改MA_OUTPUT_PATH为: /home/ma-user/modelarts/outputs/train_url_0
MA_OUTPUT_ROOT = '/cache/ma-user-work'
使用的前提是,拉起任务时有设置output路径。设置后,训练生成的文件会被自动上传到输出数据的obs路径,上传文件和模型训练过程解耦。