1. 报错描述
1.1 系统环境
Hardware Environment(Ascend/GPU/CPU): Ascend Software Environment: -- MindSpore version (source or binary): 1.6.0 -- Python version (e.g., Python 3.7.5): 3.7.6 -- OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic -- GCC/Compiler version (if compiled from source):
1.2 基本信息
1.2.1 脚本
训练脚本是通过构建MatMul的单算子网络,对输入的两个张量计算矩阵乘积。脚本如下:
01 class Net(nn.Cell):
02 def __init__(self):
03 super(Net, self).__init__()
04
05 def construct(self, x1, x2):
06 out = ops.matmul (x1, x2)
07 return out
08
09 net = Net()
10 x1 = Tensor(np.arange(32*12800).reshape(32, 12800), mindspore.float32)
11 x2 = Tensor(np.arange(10*1280).reshape(10, 1280), mindspore.float32)
12 out = net(x1, x2)
13 print('out',out.shape)
1.2.2 报错
这里报错信息如下:
The function call stack (See file '/rank_0/om/analyze_fail.dat' for more details):
# 0 In file demo.py(06)
out = ops.matmul (x1, x2)
^
# 1 In file /lib/python3.7/site-packages/mindspore/ops/composite/math_ops.py(810)
if not _check_same_type(dtype1, dtype2):
^
# 2 In file /lib/python3.7/site-packages/mindspore/ops/composite/math_ops.py(824)
if F.rank(x2) == 2:
# 3 In file /lib/python3.7/site-packages/mindspore/ops/composite/math_ops.py(825)
if F.rank(x1) > 2:
# 4 In file /lib/python3.7/site-packages/mindspore/ops/composite/math_ops.py(827)
res = P.MatMul(False, transpose_b)(x1, x2)
^
Traceback (most recent call last):
File "demo.py", line 12, in <module>
out = net(x1, x2)
File "/lib/python3.7/site-packages/mindspore/nn/cell.py", line 586, in __call__
out = self.compile_and_run(*args)
File "/lib/python3.7/site-packages/mindspore/nn/cell.py", line 964, in compile_and_run
self.compile(*inputs)
File "/lib/python3.7/site-packages/mindspore/nn/cell.py", line 937, in compile
_cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
File "/lib/python3.7/site-packages/mindspore/common/api.py", line 1040, in compile
result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode())
File "/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 467, in __check__
fn(*(x[track] for x in args))
File "/lib/python3.7/site-packages/mindspore/ops/operations/math_ops.py", line 1430, in check_shape
raise ValueError(f"For '{cls_name}', the input dimensions must be equal, but got 'x1_col': {x1_col} "
ValueError: For 'MatMul', the input dimensions must be equal, but got 'x1_col': 12800 and 'x2_row': 10. And 'x' shape [32, 12800](transpose_a=False), 'y' shape [10, 1280](transpose_b=False).
2 原因分析
我们来看报错信息,在ValueError中,写到 the input dimensions must be equal, but got ‘x1_col’: 12800 and ‘x2_row’: 10 ,结合官网对matmul算子的介绍(如下图),即x1的最后一个维度和x2的倒数第二个维度需要保持一致,可知我们输入的x1、x2对应位置的维度12800、10不满足计算要求。
当然,在出现类似shape不匹配的问题时,通常在报错日志中会打印相应的图结构调用栈信息,也可以先从调用栈信息尝试定位错误位置。调用栈信息不够清晰时,可以考虑参考analyze_fail.dat文件辅助分析。
3 解决方法
基于上面已知的原因,很容易做出如下修改:
此时执行成功,输出如下:
out(32,1280)
4 总结
定位报错问题的步骤:
1、找到报错的用户代码行: out = ops.matmul (x1, x2) ;
2、 根据日志报错信息中的关键字,缩小分析问题的范围: For ‘MatMul’, the input dimensions must be equal, but got ‘x1_col’: 12800 and ‘x2_row’: 10.] ;
3、结合官网对API的相关介绍,查看传入到API时的参数是否符合要求,也可以根据调用栈信息,参考analyze_fail.dat文件结合分析;
4、需要重点关注变量定义、初始化的正确性。

