1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=2.2.0
执行模式(PyNative/ Graph):xxx
Python版本: Python=3.9.13
操作系统平台: linux
2 报错信息
针对分布式集群训练,有的卡报错信息如下,
[ERROR] GE_ADPT(2245,fffa1ffff1e0,python):2024-01-03-13:56:51.823.242 [mindspore/ccsrc/transform/graph_ir/graph_runner.cc:352] RunGraphWithStreamAsync] Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295
EI0006: Getting socket times out. Reason: Remote Rank did not send the data in time. Please check the reason for the rank being stuck
Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.). For details:https://www.hiascend.com/document
TraceBack (most recent call last):
Call ops_kernel_info_store LoadTask fail[FUNC:Distribute][FILE:hccl_task_info.cc][LINE:323]
Call ops_kernel_info_store unloadTask fail[FUNC:Release][FILE:hccl_task_info.cc][LINE:665]
GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0xaaab1c1e9620.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
[Run][Graph]Run graph with stream asyn failed, error code = 1343225860, session id = 0,graph id = 2, stream = 0xaaab1c1e9620.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:826]
以上报错的原因在报错中也给了几种排查方向,但是大概率是因为其中有一张卡挂掉导致建立通信失败。所以我们需要找到最新挂掉卡的那张卡的报错信息。本次业务中最先报错的是device-5信息如下:
[ERROR] GE_ADPT(2239,fffaa4ff91e0,python):2024-01-03-12:57:12.973.208 [mindspore/ccsrc/transform/graph_ir/graph_runner.cc:352] RunGraphWithStreamAsync] Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/train/model.py", line 623, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback,
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/nn/cell.py", line 680, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1023, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1589, in __call__
return self.run(obj, *args, phase=phase)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1628, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 121, in wrapper
results = fn(*arg, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1608, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Exec graph failed
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E19999: Inner Error!
E19999 Call ops_kernel_info_store LoadTask fail[FUNC:Distribute][FILE:hccl_task_info.cc][LINE:323]
TraceBack (most recent call last):
Call ops_kernel_info_store unloadTask fail[FUNC:Release][FILE:hccl_task_info.cc][LINE:665]
GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 2, stream = 0xaaaad7c58570.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:517]
[Run][Graph]Run graph with stream asyn failed, error code = 1343225860, session id = 0,graph id = 2, stream = 0xaaaad7c58570.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:826]
通过两个报错信息对比我们可以发现首先报错的时间点是这个2024-01-03-12:57:12.973.208 我们可以记下这个时间
2.1 问题描述
当前业务的场景是2节点和4节点能够正常训练,8节点挂死无法拉起训练,报以下错误。
call GE RunGraphWithStreamAsync Failed, ret is: 4294967295
当前报错无法提供任何有效信息,需要结合plog和device日志分析
3 根因分析
按照上面流程提到的我们需要结合plog和device日志进一步分析,我们在日志目录下面搜索这个2024-01-03-12:57* 时间点如下:
执行命令:grep -rani ERROR | grep 2024-01-03-12:57*
plog/plog-2239_20240103120735388.log:789:[ERROR] HCCP(2239,python):2024-01-03-12:57:09.479.063 [ra_hdc.c:242]tid:103358,hdc_send_recv_pkt_recv(242) : [recv][hdc_send_recv_pkt]HDC recv msg err ret(25)
plog/plog-2239_20240103120735388.log:790:[ERROR] HCCP(2239,python):2024-01-03-12:57:09.479.094 [ra_hdc.c:308]tid:103358,hdc_send_recv_pkt(308) : [send_recv][pkt]HDC pkt recv err ret(25) phy_id(0)
plog/plog-2239_20240103120735388.log:791:[ERROR] HCCP(2239,python):2024-01-03-12:57:09.479.109 [ra_hdc.c:377]tid:103358,ra_hdc_process_msg(377) : [process][ra_hdc_msg]hdc_send_recv_pkt opcode(22) failed ret(25) phy_id(0)
plog/plog-2239_20240103120735388.log:792:[ERROR] HCCP(2239,python):2024-01-03-12:57:09.479.118 [ra_hdc.c:1713]tid:103358,ra_hdc_socket_white_list_op(1713) : [op][ra_hdc_socket_white_list]ra hdc process msg failed ret(25) phy_id(0)
plog/plog-2239_20240103120735388.log:793:[ERROR] HCCL(2239,python):2024-01-03-12:57:09.479.131 [adapter_hccp.cc:910][103358][Del][RaSocketWhiteList]errNo[0x000000000500000b] ra white list del fail, return[328207].
plog/plog-2239_20240103120735388.log:794:[ERROR] HCCL(2239,python):2024-01-03-12:57:09.479.138 [comm_base.cc:122][103358]call trace: hcclRet -> 11
plog/plog-2239_20240103120735388.log:795:[ERROR] HCCL(2239,python):2024-01-03-12:57:09.479.145 [comm_base.cc:108][103358]call trace: hcclRet -> 11
plog/plog-2239_20240103120735388.log:796:[ERROR] HCCL(2239,python):2024-01-03-12:57:09.479.150 [comm_base.cc:217][103358]call trace: hcclRet -> 11
plog/plog-2239_20240103120735388.log:797:[ERROR] HCCL(2239,python):2024-01-03-12:57:09.479.156 [comm_base.cc:82][103358]call trace: hcclRet -> 11
plog/plog-2239_20240103120735388.log:798:[ERROR] HCCL(2239,python):2024-01-03-12:57:09.479.162 [comm_factory.cc:776][103358][create][CommHD]comm halving doubling array[0] init failed
...................日志太多仅展示部分内容
从上面日志可以看到首报错日志是HCCP组件报错内容是**[ra_hdc.c:242]tid:103358,hdc_send_recv_pkt_recv(242) : [recv][hdc_send_recv_pkt]HDC recv msg err ret(25)**
此类报错主要是因为device侧因为异常,先释放了HDC连接(message日志中有HDC组件相应info级别日志提示:Server destroy success),导致HCCP host侧通过hdc通道下发opcode报错-25(session has been closed,表示对端即device session被关闭)
4 问题根因
device 因为OOM提前释放了资源,导致HCCP host侧通过hdc下发opcode报错25,(session has been closed,表示对端即device seesion被关闭),其实是系统申请device内存不够所以才会OOM,我们结合训练脚本中有设置设备可用的最大内存的配置max_device_memory,此配置设置太大导致系统申请不到device内存导致挂掉。
2节点和4节点没有出现这个报错是因为2节点和4节点的dp分别是2和4,也就是global batch size是2和4,
而到8节点时global batch size 是8
5 解决方案
尝试将max_device_memory调小以给系统预留足够显存。