在昇思大模型平台上,进行微调qwen3-32B,第一次运行时,报loss节点找不到;下午重新启动平台,报错多卡信号同步失败,早上的那个错误没有了。具体信息如下:
[ERROR] DEVICE(11430,ffffa784a020,python):2025-10-31-10:37:27.541.835 [mindspore/ccsrc/plugin/ascend/res_manager/hal_manager/ascend_err_manager.cc:161] TaskExceptionCallback] Run Task failed, task_id: 20, stream_id: 2, tid: 11430, device_id: 1, retcode: 507048 (Return error code unknown, ret code: 507048)
[ERROR] DEVICE(11430,ffffa784a020,python):2025-10-31-10:37:27.659.539 [mindspore/ccsrc/plugin/ascend/res_manager/stream_manager/ascend_stream_manager.cc:301] SyncStream] Has set launch blocking, but synchronous stream still failed. Please save the complete log information to further identify the specific error cause.
Traceback (most recent call last):
File “/home/mindspore/work/mindformers/run_mindformer.py”, line 341, in
main(config_)
File “/home/mindspore/work/mindformers/run_mindformer.py”, line 71, in main
trainer.train()
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/_checkparam.py”, line 1398, in wrapper
return func(*args, **kwargs)
File “/home/mindspore/work/mindformers/mindformers/trainer/trainer.py”, line 491, in train
self.trainer.train(
File “/home/mindspore/work/mindformers/mindformers/trainer/causal_language_modeling/causal_language_modeling.py”, line 108, in train
self.training_process(
File “/home/mindspore/work/mindformers/mindformers/trainer/base_trainer.py”, line 1243, in training_process
config.load_checkpoint = get_load_path_after_hf_convert(config, network)
File “/home/mindspore/work/mindformers/mindformers/utils/load_checkpoint_utils.py”, line 85, in get_load_path_after_hf_convert
converted_sf_path = process_hf_checkpoint(network, config.output_dir, config.load_checkpoint)
File “/home/mindspore/work/mindformers/mindformers/utils/load_checkpoint_utils.py”, line 510, in process_hf_checkpoint
barrier_world(“for the main rank to convert HuggingFace weight…”)
File “/home/mindspore/work/mindformers/mindformers/tools/utils.py”, line 748, in barrier_world
comm_func.barrier()
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/communication/comm_func.py”, line 896, in barrier
_op()
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/ops/primitive.py”, line 397, in call
return _run_op(self, self.name, args)
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/ops/primitive.py”, line 1006, in _run_op
res = _pynative_executor.run_op_async(obj, op_name, args)
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/common/api.py”, line 1672, in run_op_async
return self._executor.run_op_async(*args)
RuntimeError: Sync run failed, detail: EI0002: [PID: 11430] 2025-10-31-10:37:27.543.248 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[2], taskID[20], tag[AllReduce_192.168.129.62%eth0_60000_0_1761905137161718], AlgType(level 0-1-2):[fullmesh-ring-NHR].]. task information: [
there are(is) 1 abnormal device(s):
schedule.log报错为:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log.
其它卡会报:[ERROR] DEVICE(11432,ffffafc5b020,python):2025-10-31-10:37:27.933.831 [mindspore/ccsrc/plugin/ascend/res_manager/stream_manager/ascend_stream_manager.cc:345] SyncAllStreams] Has set launch blocking, but synchronous stream still failed. Please save the complete log information to further identify the specific error cause.
[ERROR] ME(11432,ffffafc5b020,python):2025-10-31-10:37:27.933.896 [mindspore/ccsrc/runtime/hardware_abstract/device_context/device_context_manager.cc:521] WaitTaskFinishOnDevice] SyncStream failed
