1. 系统环境
硬件环境(Ascend/GPU/CPU): Ascend910
MindSpore版本:不限
执行模式(PyNative/ Graph):不限
Python版本:不限
操作系统平台:Linux
2. 报错信息
2.1 问题描述及报错
模型正向报错信息如下:
[ERROR] [2023-11-27 13:40:37] rank 49: Traceback (most recent call last):
File "/home/ma-user/code/pangu nlp/src/obs.py", line 467, in wrapped func
run func(cloud adapter) # main function
File "/home/ma-user/code//pangu nlp/train.py", line 154, in run pangu\_train
model.train(actual sink steps, train ds, callbacks-callbacks, sink size=args.sink size, dataset sink mode=True)
File "/home/ma-user/anaconda3/envs/python-3.9/lib/python3.9/site-packages/mindspore/train/model.py", line 1068, in train
self. train(epoch,
File "/home/ma-user/anaconda3/envs/python-3.9/lib/python3.9/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/home/ma-user/anaconda3/envs/python-3.9/lib/python3.9/site-packages/mindspore/train/model.py", line 623, in train
self. train dataset sink process(epoch, train dataset, list callback,
File "/home/ma-user/anaconda3/envs/python-3.9/1ib/python3.9/site-packages/mindspore/train/model.py", line 708, in train
dataset sink process outputs = train network(*inputs)
File "/home/ma-user/anaconda3/envs/python-3.9/lib/python3.9/site-packages/mindspore/nn/cell.py", line 680, in __call__
out = self.compile and run(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/python-3.9/1ib/python3.9/site-packages/mindspore/nn/cell.py", line 1023, in compile and run
return cell graph\_executor(self, \*new args, phase-self.phase)
File "/home/ma-user/anaconda3/envs/python-3.9/lib/python3.9/site-packages/mindspore/common/api.py", line 1589, in __call__
return self.run(obj, *args, phase=phase)
File "/home/ma-user/anaconda3/envs/python-3.9/lib/python3.9/site-packages/mindspore/common/api.py", line 1628, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/home/ma-user/anaconda3/envs/python-3.9/1ib/python3.9/site-packages/mindspore/common/api.py", line 121, in wrapper
results = fn(*arg, **kwargs)
File "/home/ma-user/anaconda3/envs/python-3.9/1ib/python3.9/site-packages/mindspore/common/api.py", line 1608, in _exec_pip
return self. graph executor(args, phase)
RuntimeError: Sync stream failed:Ascend_0
plog报错信息:
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.557 [device_error_proc.cc:1138]666400 ProcessStarsCoreErrorInfo:The error from device(chipId:0, dieId:0), serial number is 14, there is an aivec error exception, core id is 46, error code = 0x800000, dump info: pc start: 0x1240c001b000, current: 0x1240c001b4f0, vec error info: 0xd1070f2790, mte error info: 0x6c030655c8, ifu error info: 0x37fb88b75f040, ccu error info: 0xdb3b4c902c8d5dff, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x1240c00bf000.
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.584 [device_error_proc.cc:1150]666400 ProcessStarsCoreErrorInfo:report error module_type=5, module_name=EZ9999
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.589 [device_error_proc.cc:1150]666400 ProcessStarsCoreErrorInfo:The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x30655c8, fixp_error1 info: 0x6c fsmId:0, tslot:0, thread:0, ctxid:0, blk:17, sublk:0, subErrType:4.
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.606 [stream.cc:3116]666400 EnterFailureAbort:stream_id=2 enter failure abort.
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.647 [stream.cc:1480]666400 GetError:Stream Synchronize failed, stream_id=2, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.656 [task_info.cc:4737]666400 ReportErrorInfoForModelExecuteTask:model execute error, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.674 [task_info.cc:4704]666400 PrintErrorInfoForModelExecuteTask:stream_id=2, task_id=3, write_sq_id=5, read_sq_id=5, fsm_state=0, sqVirtualAddr=3884417024, head equal tail flag=0.
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.682 [task_info.cc:4710]666400 PrintErrorInfoForModelExecuteTask:model execute task failed, device_id=0, model stream_id=2, model task_id=3, flip_num=0, model_id=0, first_task_id=0
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.775 [task_info.cc:1574]666400 PrintErrorInfoForDavinciTask:Aicore kernel execute failed, device_id=0, stream_id=5, report_stream_id=2, task_id=0, flip_num=0, fault kernel_name=00_1_Default/Gather-op177, program id=0, hash=12989916100239791524.
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.814 [task_info.cc:1516]666400 GetArgsInfo:[AIC_INFO] args(0 to 4) after execute:0x124ee3e00000, 0x124f7ac00e00, 0x124f79e00200, 0x124180000000,
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.823 [task_info.cc:1519]666400 GetArgsInfo:print 1 Times totalLen=(4*8)Bytes, argsSize=32
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.834 [task_info.cc:1578]666400 PrintErrorInfoForDavinciTask:[AIC_INFO] after execute:args print end
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.860 [engine.cc:3646]666400 StarsResumeRtsq:stop scheduling in abort failure mode: stream_id=2, sq_id=2, sq_head=4, task_id=4, taskType=14.
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.869 [engine.cc:3478]666400 SyncTask:stream is in failure abort and need to reclaim all, stream_id=2.
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.880 [stream.cc:1480]666400 GetError:Stream Synchronize failed, stream_id=2, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.885 [stream.cc:1483]666400 GetError:report error module_type=5, module_name=EZ9999
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.890 [stream.cc:1483]666400 GetError:Aicore kernel execute failed, device_id=0, stream_id=5, report_stream_id=2, task_id=0, flip_num=0, fault kernel_name=00_1_Default/Gather-op177, program id=0, hash=12989916100239791524.
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.906 [stream.cc:1483]666400 GetError:report error module_type=5, module_name=EZ9999
[ERROR] RUNTIME(665015,python3):2023-11-27-14:49:47.448.911 [stream.cc:1483]666400 GetError:[AIC_INFO] after execute:args print end
3. 根因分析
plog显示gather算子执行失败,且有The DDR address of the MTE instruction is out of range,即地址访问越界,这种情况一般是gather的索引越界了。nlp领域里对输入做embedding时用到gather算子,怀疑是输入数据超过了vocabsize。构造相同shape的全1输入数据,发现模型可以正常跑,说明是数据问题。检查数据发现vocabsize参数和数据集不匹配,比数据集的实际vocabsize小。
4. 解决方法
使用和数据集的数据范围匹配的vocabsize。