在昇思大模型平台上微调qwen3-32B模型,单机多卡信号同步失败 Sync run failed

在昇思大模型平台上,进行微调qwen3-32B,第一次运行时,报loss节点找不到;下午重新启动平台,报错多卡信号同步失败,早上的那个错误没有了。具体信息如下:

[ERROR] DEVICE(11430,ffffa784a020,python):2025-10-31-10:37:27.541.835 [mindspore/ccsrc/plugin/ascend/res_manager/hal_manager/ascend_err_manager.cc:161] TaskExceptionCallback] Run Task failed, task_id: 20, stream_id: 2, tid: 11430, device_id: 1, retcode: 507048 (Return error code unknown, ret code: 507048)
[ERROR] DEVICE(11430,ffffa784a020,python):2025-10-31-10:37:27.659.539 [mindspore/ccsrc/plugin/ascend/res_manager/stream_manager/ascend_stream_manager.cc:301] SyncStream] Has set launch blocking, but synchronous stream still failed. Please save the complete log information to further identify the specific error cause.
Traceback (most recent call last):
File “/home/mindspore/work/mindformers/run_mindformer.py”, line 341, in
main(config_)
File “/home/mindspore/work/mindformers/run_mindformer.py”, line 71, in main
trainer.train()
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/_checkparam.py”, line 1398, in wrapper
return func(*args, **kwargs)
File “/home/mindspore/work/mindformers/mindformers/trainer/trainer.py”, line 491, in train
self.trainer.train(
File “/home/mindspore/work/mindformers/mindformers/trainer/causal_language_modeling/causal_language_modeling.py”, line 108, in train
self.training_process(
File “/home/mindspore/work/mindformers/mindformers/trainer/base_trainer.py”, line 1243, in training_process
config.load_checkpoint = get_load_path_after_hf_convert(config, network)
File “/home/mindspore/work/mindformers/mindformers/utils/load_checkpoint_utils.py”, line 85, in get_load_path_after_hf_convert
converted_sf_path = process_hf_checkpoint(network, config.output_dir, config.load_checkpoint)
File “/home/mindspore/work/mindformers/mindformers/utils/load_checkpoint_utils.py”, line 510, in process_hf_checkpoint
barrier_world(“for the main rank to convert HuggingFace weight…”)
File “/home/mindspore/work/mindformers/mindformers/tools/utils.py”, line 748, in barrier_world
comm_func.barrier()
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/communication/comm_func.py”, line 896, in barrier
_op()
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/ops/primitive.py”, line 397, in call
return _run_op(self, self.name, args)
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/ops/primitive.py”, line 1006, in _run_op
res = _pynative_executor.run_op_async(obj, op_name, args)
File “/home/mindspore/miniconda/envs/jupyter/lib/python3.9/site-packages/mindspore/common/api.py”, line 1672, in run_op_async
return self._executor.run_op_async(*args)
RuntimeError: Sync run failed, detail: EI0002: [PID: 11430] 2025-10-31-10:37:27.543.248 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank: [unknown]. base information: [streamID:[2], taskID[20], tag[AllReduce_192.168.129.62%eth0_60000_0_1761905137161718], AlgType(level 0-1-2):[fullmesh-ring-NHR].]. task information: [
there are(is) 1 abnormal device(s):

schedule.log报错为:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{1}, worker 1 is the first one timed out, please check its log.

其它卡会报:[ERROR] DEVICE(11432,ffffafc5b020,python):2025-10-31-10:37:27.933.831 [mindspore/ccsrc/plugin/ascend/res_manager/stream_manager/ascend_stream_manager.cc:345] SyncAllStreams] Has set launch blocking, but synchronous stream still failed. Please save the complete log information to further identify the specific error cause.
[ERROR] ME(11432,ffffafc5b020,python):2025-10-31-10:37:27.933.896 [mindspore/ccsrc/runtime/hardware_abstract/device_context/device_context_manager.cc:521] WaitTaskFinishOnDevice] SyncStream failed

用户您好,欢迎使用MindSpore,已经收到您上述的问题,还请耐心等待下答复~

您好,问题看起来像在进行在线加载HuggingFace权重的时候超时了,在线加载HuggingFace权重是在rank 0进行转换的,其他卡会等待rank 0转好后才继续往下执行任务,转换需要一定的时间,根据您提供的报错信息,看起来是rank 0还没转换好权重,其他卡在等待到超时时长后就退出任务了。

建议您可以设置环境变量,延长其他卡的超时检测等待时间:

export HCCL_CONNECT_TIMEOUT=3600

@xinghan 用户您好,MindSpore支撑人已经分析并给出了问题的原因,由于较长时间未看到您回复,这里版主将进行采纳回答的结帖操作,如果还其他疑问请发新帖子提问,谢谢支持~

此话题已在最后回复的 60 分钟后被自动关闭。不再允许新回复。