1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend910
MindSpore版本: mindspore=2.3.1
执行模式(PyNative/ Graph): 不限
Python版本: Python=3.9.2
操作系统平台: linux
2 报错信息
2.1 问题描述
使用CPU正常训练:
context.set_context(device_target="CPU", mode=context.PYNATIVE_MODE, pynative_synchronize =True)
ms.dataset.config.set_debug_mode(True)
如果使用
context.set_context(device_target="Ascend", save_graphs=False, mode=context.GRAPH_MODE)
则会报错。
2.2 报错信息:
[WARNING] GE_ADPT(700158,ffffb4b28be0,python3):2024-12-23-16:52:11.085.773 [mindspore/ccsrc/transform/graph_ir/utils.cc:84] FindAdapter] Can't find OpAdapter for Bernoulli
[ERROR] DEVICE(700158,fffbfbfff1e0,python3):2024-12-23-16:52:43.004.511 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_manager.cc:257] SyncStream] Call runtime aclrtSynchronizeStreamWithTimeout error.
[ERROR] DEVICE(700158,fffbfbfff1e0,python3):2024-12-23-16:52:43.004.584 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_manager.cc:270] SyncAllStreams] SyncStream for stream id 0 failed.
0%| | 0/36 [01:37<?, ?it/s]
Traceback (most recent call last):
File "/data/lai/aagnet_ms_v1/seg_trainer.py", line 186, in <module>
loss = train_net(node_attr, edge_attr, node_grid, edge_grid, label, *batch_homo.get_batched_graph()).asnumpy()
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/nn/cell.py", line 703, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1074, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/common/api.py", line 1860, in __call__
return self.run(obj, *args, phase=phase)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/common/api.py", line 1911, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/common/api.py", line 185, in wrapper
results = fn(*arg, **kwargs)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/common/api.py", line 1891, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Sync stream failed:Ascend_0
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E29999: Inner Error!
E29999: 2024-12-23-16:52:22.334.581 The error from device(chipId:0, dieId:0), serial number is 1. there is a dsa error, dsa channel is 2, error intr status=0x4.[FUNC:ProcessStarsDsaErrorInfo][FILE:device_error_proc.cc][LINE:1491][THREAD:700513]
TraceBack (most recent call last):
Task execute failed,device_id=0,stream_id=2,task_id=270,flip_num=0,task_type=50(STARS_COMMON).[FUNC:GetError][FILE:stream.cc][LINE:1082][THREAD:702423]
rtStreamSynchronizeWithTimeout execute failed, reason=[task exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53][THREAD:702423]
synchronize stream failed, runtime result = 507001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161][THREAD:702423]
(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:865 Run
Traceback (most recent call last):
File "/data/lai/aagnet_ms_v1/seg_trainer.py", line 186, in <module>
loss = train_net(node_attr, edge_attr, node_grid, edge_grid, label, *batch_homo.get_batched_graph()).asnumpy()
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/nn/cell.py", line 703, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1074, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/common/api.py", line 1860, in __call__
return self.run(obj, *args, phase=phase)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/common/api.py", line 1911, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/common/api.py", line 185, in wrapper
results = fn(*arg, **kwargs)
File "/root/miniconda3/envs/mindspore_gl/lib/python3.9/site-packages/mindspore/common/api.py", line 1891, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Sync stream failed:Ascend_0
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E29999: Inner Error!
E29999: 2024-12-23-16:52:22.334.581 The error from device(chipId:0, dieId:0), serial number is 1. there is a dsa error, dsa channel is 2, error intr status=0x4.[FUNC:ProcessStarsDsaErrorInfo][FILE:device_error_proc.cc][LINE:1491][THREAD:700513]
TraceBack (most recent call last):
Task execute failed,device_id=0,stream_id=2,task_id=270,flip_num=0,task_type=50(STARS_COMMON).[FUNC:GetError][FILE:stream.cc][LINE:1082][THREAD:702423]
rtStreamSynchronizeWithTimeout execute failed, reason=[task exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53][THREAD:702423]
synchronize stream failed, runtime result = 507001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161][THREAD:702423]
(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:865 Run
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /data/lai/aagnet_ms_v1/wandb/offline-run-20241223_165002-5lu2e9y2
wandb: Find logs at: wandb/offline-run-20241223_165002-5lu2e9y2/logs
[ERROR] DEVICE(700158,ffffb4b28be0,python3):2024-12-23-16:53:43.316.438 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_manager.cc:257] SyncStream] Call runtime aclrtSynchronizeStreamWithTimeout error.
[ERROR] DEVICE(700158,ffffb4b28be0,python3):2024-12-23-16:53:43.316.502 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:578] SyncStream] Sync default stream failed.
[ERROR] DEVICE(700158,ffffb4b28be0,python3):2024-12-23-16:53:43.316.533 [mindspore/ccsrc/runtime/device/kernel_runtime_manager.cc:134] WaitTaskFinishOnDevice] SyncStream failed
[ERROR] DEVICE(700158,ffffb4b28be0,python3):2024-12-23-16:53:43.316.668 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_manager.cc:257] SyncStream] Call runtime aclrtSynchronizeStreamWithTimeout error.
[ERROR] DEVICE(700158,ffffb4b28be0,python3):2024-12-23-16:53:43.316.697 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_manager.cc:270] SyncAllStreams] SyncStream for stream id 0 failed.
[ERROR] ME(700158,ffffb4b28be0,python3):2024-12-23-16:53:43.316.720 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:490] WaitTaskFinishOnDevice] SyncStream failed
3 根因分析
查找MindSpore官网中ops.Bernoulli接口API的文档解释。
支持平台:GPU、CPU。因此在Ascend硬件平台上会有报错。
context.set_context(device_target="CPU", mode=context.PYNATIVE_MODE, pynative_synchronize =True)
context.set_context(device_target="GPU", mode=context.PYNATIVE_MODE, pynative_synchronize =True)
4 解决方案
根据官网API的指导。可以在GPU、CPU硬件环境中导入使用ops.Bernoulli接口。