1. 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.2
执行模式(PyNative/ Graph): 不限
2. 报错信息
2.1 问题描述
MindSpore跑wizardcoder模型,dp:mp:pp=1:2:4,num_layer=16时可以跑通,但是当设置num_layer=20时出现如下报错。
2.2 报错信息
EI0006: Getting s0cket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormmal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not sta rted by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or theTLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with othererrors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference be tween the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL CONNECT TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command andHCCN connectivity check examples.). For details:https://www.hiascend.com/document
3. 解决方案
因此在多机并行时,只能采用离线切分的方法。具体操作如下:
- 先找出哪张卡没有报错EI0006: getting socket time out,没有这个报错的卡对应的报错才是根本原因。例如本案例发现rank1没有报错,rank4有报错。
rank1不报错
[root@localhost research]# grep -rna "EI0006" /home/wizardcoder/1_wizardcoder-mindformers/research/output/log/ rank_1/mindformer.log
6266: [WARNING] GE (1253271,python) :2023-09-08-16:36:48.256.311 [ error_manager.cc:510]1253271 ParseJsonFile: [Check][Config]There are the sameerror code EI0006 in /us r/local/c30_0818/CANN-7.0/aarch64 -linux/lib64/../conf/error_manager/error_code.json
rank4报错
[root@localhost research]# grep -rna “EI0006" /home/wizardcoder/1_wizardcoder-mindformers/research/output/log/rank_4/mindformer.log
EI0006: Getting s0cket times out. Reason: 1. The remote does not initiate a connect request. some NPUs in the cluster are abnormmal. 2. The remote does not initiate a connect request because the collective communication operator is started too late or is not sta rted by some NPU in the cluster. 3. The communication link is disconnected. (For example, the IP addresses are not on the same network segment or theTLS configurations are inconsistent.)
Solution: 1. Check the rank service processes with othererrors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference be tween the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL CONNECT TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command andHCCN connectivity check examples.). For details:https://www.hiascend.com/document
- 查看rank1的mind.log日志,发现是内存不足导致的问题,至此找到了问题原因。
EL0004: Failed to allocate memory.
Possible Cause: Available memory is insufficient
Solution: Close applications not in use.
TraceBack (most recent call last):
rtMalloc execute failed, reason=[driver error:out of memory] [FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
Call rtMalloc fail, purpose:feature map „used for op input and output, size:54966232576, device_id:0[FUNC:MallocMemoryl[FILE: graph_mem_allocator.cc] [LINE:74]
Malloc Memory fail, purpose: feature map,used for op input and output, memory_key:0_f, memory_size:54966232576, device_id:0[FŬNC:MallocMemoryl [FILE:graph_mem_alloca tor.cc][LINE: 128]
Malloc featuremap memory fail. malloced_memory_sizel0] < mem_size [54966232576], device_id[0] [FUNC:MallocFeatureMapMem] [FILE:davinci_model . cc][LINE:5294]
MallocFeatureMapMem fail , data_size:54966232576, mode l_id:3, check invalidl FUNC: InitFeatureMapAndP2PMem] [FILE: davinci_modeL.cc] [LINE:383]