1. 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=2.2.0&2.2.10
执行模式(PyNative/ Graph):图模式
Python版本: Python=3.7
操作系统平台: linux
2. 报错信息
MindSpore训练异常中止:Try to send request before Open()、Try to get response before Open()、Response is empty
Sampling with EulerEDMSampler for 50 steps: 0%| | 0/50 [00:00<?, ?it/s]
Sampling with EulerEDMSampler for 50 steps: 0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/stable_diffusion_xl/demo/pangu3_infer_server_modelarts.py", line291, in <module>
sample(args)
File "/home/ma-user/modelarts/user-job-dir/stable_diffusion_xl/demo/pangu3_infer_server_modelarts.py", line274, in sample
version dict,
File "/home/ma-user/modelarts/user-job-dir/stable diffusion_xl/demo/pangu3 infer_server_modelarts.py", line207, in run_txt2img
samples z = sampler(model, high timestamp model, randn, cond=c, uc=uc, other c=other c)
File "/home/ma-user/modelarts/user-job-dir/stable_diffusion_xl/gm/modules/diffusionmodules/sampler.py", line 218, in _call_
gamma,
File "/home/ma-user/modelarts/user-job-dir/stable_diffusion_xl/gm/modules/diffusionmodules/sampler.py", line189, in sampler_step
denoised = self.denoise(x,model, sigma hat, cond, uc, other c)
File "/home/ma-user/modelarts/user-job-dir/stable_diffusion_xl/gm/modules/diffusionmodules/sampler.py", line 54, in denoise
c skip, c_out, c_in, c noise = model.denoiser(sigmas, noised_input.ndim)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 664, in _call_
raise err
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/nn/cell.py", line 661, in _call_
pynative executor.end graph(seIf, output, *args, **kwargs)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/common/api.py", line 1304, in end_graph
self. executor.end graph(obj, output, *args, *(kwargs.values0))
RuntimeError: Try to send reguest before Open().For more details
3. 根因分析
此类问题的直接原因一般是算子编译的子进程挂了或者调用阻塞卡住导致的超时,可以从以下几个方面进行排查:
- 检查日志,在这个错误前是否有其他错误日志,如果有请先解决前面的错误,一些算子相关的问题(比如昇腾上TBE包没装好,GPU上没有nvcc)会导致后续的此类报错;
- 如果有使用图算融合特性,有可能是图算的AKG算子编译卡死超时导致,可以尝试关闭图算特性;
- 在昇腾上可以尝试减少算子并行编译的进程数,可以通过环境变量MS_BUILD_PROCESS_NUM设置,取值范围为1~24(建议修改为16, 正常Ascend机器host cpu192个线程,云上做了限制,实际没192个。框架默认24个,因此报错)
- 检查主机的内存和cpu占用情况,有可能是主机的内存和cpu占用过高,导致算子编译进程无法启动,出现了编译失败,可以尝试找出占用内存和cpu过高的进程,对其进行优化;
- 如果是在云上的训练环境遇到这个问题,可以尝试重启内核。