MindSpore8卡报Socket times out问题

Hanshize · 2025 年10 月 6 日 03:11

1. 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.2
执行模式（PyNative/ Graph）: 不限

2. 报错信息

2.1 问题描述

MindSpore跑wizardcoder模型，dp:mp:pp=1:2:4，num_layer=16时可以跑通，但是当设置num_layer=20时出现如下报错。

2.2 报错信息

EI0006: Getting s0cket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormmal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not sta rted by some NPU in the cluster.  3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or theTLS configurations are inconsistent.)  
     Solution: 1. Check the rank service processes with othererrors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference be tween the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL CONNECT TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command andHCCN connectivity check examples.). For details:https://www.hiascend.com/document

3. 解决方案

因此在多机并行时，只能采用离线切分的方法。具体操作如下：

先找出哪张卡没有报错EI0006: getting socket time out，没有这个报错的卡对应的报错才是根本原因。例如本案例发现rank1没有报错，rank4有报错。

rank1不报错

[root@localhost research]# grep -rna "EI0006" /home/wizardcoder/1_wizardcoder-mindformers/research/output/log/ rank_1/mindformer.log  
6266: [WARNING] GE (1253271,python) :2023-09-08-16:36:48.256.311 [ error_manager.cc:510]1253271 ParseJsonFile: [Check][Config]There are the sameerror code EI0006 in /us r/local/c30_0818/CANN-7.0/aarch64 -linux/lib64/../conf/error_manager/error_code.json

rank4报错

[root@localhost research]# grep -rna “EI0006" /home/wizardcoder/1_wizardcoder-mindformers/research/output/log/rank_4/mindformer.log  
EI0006: Getting s0cket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormmal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not sta rted by some NPU in the cluster.  3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or theTLS configurations are inconsistent.)  
     Solution: 1. Check the rank service processes with othererrors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference be tween the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL CONNECT TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command andHCCN connectivity check examples.). For details:https://www.hiascend.com/document

查看rank1的mind.log日志，发现是内存不足导致的问题，至此找到了问题原因。

EL0004: Failed to allocate memory.  
     Possible Cause: Available memory is insufficient  
     Solution: Close applications not in use.  
     TraceBack (most recent call last):  
     rtMalloc execute failed, reason=[driver error:out of memory] [FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]  
     Call rtMalloc fail, purpose:feature map „used for op input and output, size:54966232576, device_id:0[FUNC:MallocMemoryl[FILE: graph_mem_allocator.cc] [LINE:74]  
     Malloc Memory fail, purpose: feature map,used for op input and output, memory_key:0_f, memory_size:54966232576, device_id:0[FŬNC:MallocMemoryl [FILE:graph_mem_alloca   tor.cc][LINE: 128]  
     Malloc featuremap memory fail. malloced_memory_sizel0] < mem_size [54966232576], device_id[0] [FUNC:MallocFeatureMapMem] [FILE:davinci_model . cc][LINE:5294]  
     MallocFeatureMapMem fail , data_size:54966232576, mode l_id:3, check invalidl FUNC: InitFeatureMapAndP2PMem] [FILE: davinci_modeL.cc] [LINE:383]

话题	回复	浏览量
MindSpore分布式8节点报错Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295 分布式并行-Distributed Parallelsim	13	2025 年9 月 24 日
模型并行策略为 1:1:8 时报错RuntimeError: Stage num is 8 is not equal to stage used: 5 分布式并行-Distributed Parallelsim	12	2025 年9 月 27 日
MindSpoer报错：The strategy is ((6, 4), (4,6)), the value of stategy must be the power of 2, but get 6. 分布式并行-Distributed Parallelsim	15	2025 年7 月 25 日
[LoadTask] Distribute Task Failed 报错解决模型训练-Model Training	12	2025 年9 月 11 日
MindSpore报错Please try to reduce 'batch_size' or check whether exists extra large shape.方法二分布式并行-Distributed Parallelsim	4	2025 年10 月 21 日

MindSpore8卡报Socket times out问题

1. 系统环境

2. 报错信息

2.1 问题描述

2.2 报错信息

3. 解决方案

相关话题