MindSpore跑模型并行报错ValueError: array split does not result in an equal division

1. 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
执行模式:静态图
Python版本:3.7
操作系统平台:Linux

2. 报错信息

2.1 问题描述

MindSpore跑模型并行,随机初始化模型能够正常跑通,加载预训练模型报numpy的错误,但是根据调用栈可以看到是参数切分部分调过去。报错:

ValueError: array split does not result in an equal division  

报错信息:

Traceback (most recent call last):  
  File "/opt/huawei/schedule-train/algorithm/*/main.py", line 701, in <module>  
    main(config_)  
  File "/opt/huawei/schedule-train/algorithm/*/main.py", line 673, in main  
    train_prompt, train_actor_logprobs, train_sft_logprobs, train_critic_r, train_reward_r)  
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 636, in __call__  
    out = self.compile_and_run(*args, **kwargs)  
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 959, in compile_and_run  
    self.compile(*args, **kwargs)  
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 936, in compile  
    jit_config_dict=self._jit_config_dict, *args, **kwargs)  
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1385, in compile  
    result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())  
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/parallel/_utils.py", line 105, in _slice_parameter  
    new_tensor = _load_tensor_by_layout(parameter, layout)  
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/parallel/_tensor.py", line 261, in _load_tensor_by_layout  
    tensor_slice = np.split(tensor_slice, size)[rank]  
  File "<__array_function__ internals>", line 6, in split  
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 873, in split  
    'array split does not result in an equal division') from None  
ValueError: array split does not result in an equal division

3. 根因分析

在分析之前,首先需要知道模型并行加载切分的模型文件是需要首先编译才能够加载,以下是我们报错场景的一段伪代码:

net_with_grade = TrainOneStepWithLossScale(with_loss,**args)  
token_idx = Tensor(np.random.random((8,4096)).astype(np.int32))  
atten_mask = Tensor(np.random.sample((8, 4096, 4096)).astype(np.float32))  
net_with_grade.compile(token_idx, atten_mask)  
    
ckpt_name = f"node_{int(rank/8)}/saved_{rank}_large.ckpt"  
param_dict = mindspore.load_checkpoint(local_ckpt_path)  
train_not_load = mindspore.load_param_into_net(with_loss, param_dict)  
    
dataloader = create_data(**args)  
for data in dataloader.create_dict_iterator():  
    train_token_idx = data['token_idx']  
    train_atten_mask = data['atten_mask']  
    loss = net_with_grade(train_token_idx, train_atten_mask)  
    print(loss)

正如前述,加载预训练模型之后报错,我们在报错位置打印出了当时的shape信息,如下

_load_tensor_by_layout(parameter, layout): Parameter (name=backbone.backbone.blocks.0.attention.projection.weight, shape=(80, 5120), dtype=Float16, requires_grad=True) ([8, 8], [0, -1], [80, 5120], 0, True, '8-682084589588865485')  
    (10, 5120) 8ba

根据报错是slpit的输入的shape不能被size整除,所以报错,我们Parameter原始shape信息是(640,5120), mp(模型并行数)是8。所以我们编译之后会得到上面的(80,5120)的parameter shape。但是报错时成了(10, 5120)显然是经过了二次切分导致。预期只需要一次切分即可。也就是说上面代码中compile会执行一次编译,在for循环时net_with_grade(train_token_idx, train_atten_mask)又会执行一次编译导致出现二次切分。

4. 解决方案

对于解决此类问题,首先需要解决二次编译的问题,一般情况只要输入shape和数据类型一致就不会导致二次编图,这里出现重新编图可以排查数据输入shape和类型是否不一致。经过排查是compile时的atten_mask和训练时的train_atten_mask数据类型不一致导致。compile时是fp32,训练时是fp16,修改compile时的第二个输入类型为fp16,修改如下:

atten_mask = Tensor(np.random.sample((8, 4096, 4096)).astype(np.float16))