MindSpore跑分布式报错TypeError: The parameters number of the function is 636, but the number of provided arguments is 635.

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=2.2.0&2.2.10
执行模式(PyNative/ Graph):图模式
Python版本: Python=3.7
操作系统平台: linux

2 报错信息

Traceback (most recent call last):  
  File "multi_gpu_infer.py", line 346, in <module>  
    eval_model(args)  
  File "multi_gpu_infer.py", line 260, in eval_model  
    warm_up_model.infer_predict_layout(ms.Tensor(np.ones(shape=(1, 3, 336,336)), ms.float16))  
  File "/root/miniconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 1882, in infer_predict_layout  
    predict_net.compile(*predict_data)  
  File "/root/miniconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 998, in compile  
    jit_config_dict=self._jit_config_dict, *args, **kwargs)  
  File "/root/miniconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1547, in compile  
    result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())  
TypeError: The parameters number of the function is 636, but the number of provided arguments is 635.  
FunctionGraph : mindformers_models_blip2_blip2_llama_Blip2ImageToTextGeneration_construct_1  
NodeInfo: In file /disk1/fail/scripts/mf_parallel0/mindformers/models/blip2/blip2_llama.py:536  
    def construct(self, image: ms.Tensor, text_input_ids: ms.Tensor):  
    ^  
----------------------------------------------------  
- C++ Call Stack: (For framework developers)  
----------------------------------------------------  
mindspore/ccsrc/pipeline/jit/ps/static_analysis/evaluator.cc:486 Eval  
----------------------------------------------------  
- The Traceback of Net Construct Code:  
----------------------------------------------------  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/blip2/blip2_llama.py:536  
    def construct(self, image: ms.Tensor, text_input_ids: ms.Tensor):  
    ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/blip2/blip2_llama.py:544  
        projected_qformer_output = self.forward_qformer_and_proj(image)  
                                   ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/blip2/blip2_llama.py:484  
    def forward_qformer_and_proj(self, image: ms.Tensor):  
    ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/blip2/blip2_llama.py:486  
        image_embeds = self.visual_encoder(image)  
                       ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/blip2/blip2.py:112  
    def construct(self, image):  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/blip2/blip2.py:113  
        return self.construct_without_pool(image)  
               ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/vit/vit.py:163  
    def construct_without_pool(self, image, mask=None):  
    ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/vit/vit.py:174  
        for block in self.blocks:  
                     ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/vit/vit_modules.py:342  
    def construct(self, x, input_mask, rel_pos_bias=None):  
    ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/vit/vit_modules.py:372  
        mlp_logit = self.output(output_x)  
                    ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/vit/vit_modules.py:452  
    def construct(self, x):  
    ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/vit/vit_modules.py:456  
        hidden = self.mapping(x)  
                 ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/vit/vit_modules.py:198  
    def construct(self, x):  
    ^  
# In file /disk1/fail/scripts/mf_parallel0/mindformers/models/vit/vit_modules.py:201  
        x = P.Reshape()(x, (-1, self.in_channels))

2.1 问题描述

在跑分布式推理时,调用了infer_predict_layout报参数数目对不上,

TypeError: The parameters number of the function is 636, but the number of provided arguments is 635.

2.2 脚本代码

#调用方式  
# shard model and load sharded ckpt  
warm_up_model = Model(model)  
warm_up_model.infer_predict_layout(ms.Tensor(np.ones(shape=(1, 3, 336,336)), ms.float32))  
# 定义  
class Blip2ImageToTextGeneration(Blip2Llama):  
    """  
    Blip2ImageToTextGeneration rely on Blip2Llama, used for image to text genearation.  
    Args:  
        config (Blip2Config): The config of Blip2ImageToTextGeneration.  
    Examples:  
        >>> from mindformers import Blip2ImageToTextGeneration  
        >>> model = Blip2ImageToTextGeneration.from_pretrained('itt_blip2_stage2_vit_g_llama_7b')  
        >>> type(model)  
        <class 'mindformers.models.blip2.blip2_llama.Blip2ImageToTextGeneration'>  
    """  
    _support_list = MindFormerBook.get_model_support_list()['itt']['blip2']  
    def __init__(self, config: Blip2Config, **kwargs):  
        super(Blip2ImageToTextGeneration, self).__init__(config, **kwargs)  
        self.llama_model.set_train(False)  
        self.one_prefix = ops.Ones()  
        self.expand_dims = P.ExpandDims()  
        self.query_length = self.config.qformer_config.query_length  
    def construct(self, image: ms.Tensor, text_input_ids: ms.Tensor):  
        if len(text_input_ids.shape) == 1:  
                text_input_ids = self.expand_dims(text_input_ids, 0)  
        ............

3 根因分析

从报错中我们知道方法需要的参数是636,但是我们提供的是635,少一个。再结合上面的代码片段我们知道,调用的地方给了一个输入,在定义的地方我们定义了两个输入(image和text_input_ids)。所以会这里会报参数数目对不上的问题,

4 解决方案

修改如下

# shard model and load sharded ckpt  
warm_up_model = Model(model)  
warm_up_model.infer_predict_layout(ms.Tensor(np.ones(shape=(1, 3, 336,336)), ms.float32),ms.Tensor(np.ones(shape=(1, config.seq_length)), ms.int32))

这个报错下面的调用栈其实作用不大,当出现参数数目对不上的报错时,我们应该首先去排查给网络的输入参数是否和定义的一致。