将torch架构的模型迁移到mindspore架构中时精度不一致

chengxiaoli · 2025 年8 月 4 日 02:09

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.2.14
执行模式（PyNative/ Graph）: PyNative
Python版本: Python=3.10
操作系统平台: Ubuntu18.04

2 报错信息

2.1 脚本信息

torch的代码：

class GELUConvBlock(nn.Module):
    def __init__(
        self, in_ch, out_ch, group_size):
        super().__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, 3, 1, 1)
        self.group_norm = nn.GroupNorm(group_size, out_ch)
        self.gelu = nn.GELU()
        
    def forward(self, x):
        x = self.conv(x)
        print(x)
        x = self.group_norm(x)
        x = self.gelu(x)
        return x

固定了nn.conv2d的权重。

import numpy as np

# 固定随机种子
np.random.seed(42)

# 卷积参数
in_ch = 64
out_ch = 32
kernel_size = 3

# 统一权重 shape: (out_channels, in_channels, kernel_h, kernel_w)
weight_np = np.random.randn(out_ch, in_ch, kernel_size, kernel_size).astype(np.float32)
print(weight_np)

得到权重如下：

[[[1 4.9671414e-01 -1.3826430e-01 6.4768857e-01]]
[ 1.5230298e+00 -2.3415338e-01 -2.3413695e-01]
[ 1.5792128e+00 7.6743472e-01 -4.6947438e-01]]
[[ 5.4256004e-01 -4.6341768e-01 -4.6572974e-01]]
[ 2.4196227e-01 -1.9132802e+00 -1.7249179e+00]
[-5.6228751e-01 -1.0128311e+00 3.1424734e-01]]
[[-9.0802407e-01 -1.4123037e+00 1.4656488e+00]]
[-2.2577630e-01 6.7528203e-02 -1.4247482e+00]]
[-5.4438275e-01 1.1092259e-01 -1.1509936e+00]]

之后固定了输入得到经过卷积后的输出，代码和输出如下：
固定随机种子以保证可复现性

torch.manual_seed(42)

# 定义输入参数
batch_size = 1
in_ch = 64
out_ch = 32
group_size = 2  # GroupNorm的分组数（必须能整除out_ch）
height, width = 16, 16

# 生成固定随机输入（形状为 [batch_size, in_ch, height, width]）
input_tensor = torch.randn(batch_size, in_ch, height, width)
print(input_tensor.dtype)

# 初始化模块
block = GELUConvBlock(in_ch, out_ch, group_size)

with torch.no_grad():
    block.conv.weight.copy_(torch.tensor(weight_np))

# 前向传播
output = block(input_tensor)

# 打印输出信息
# print(output)

torch.float32
tensor([[[[ 12.8685, -13.9928, 10.2607, …, -12.4960, -14.2980, -5.3835],
[ -6.0367, -3.7743, 0.2609, …, -16.6816, 18.7610, -1.6896],
[ 20.4237, -29.2998, -13.6293, …, -45.5814, 22.7568, -24.5661],
…,
[-41.9423, -33.8824, -6.4840, …, -35.2593, -19.5696, -28.8629],
[-15.1163, -18.7457, 9.1265, …, -7.0993, 20.0634, -1.6840],
[ 20.7623, 21.6005, -9.9522, …, 8.3804, 8.0551, 19.4785]],
[[ 17.2460, 10.6954, 33.3876, …, -15.3125, -2.3259, 12.5942],
[-17.9752, 2.5080, 19.0125, …, 16.0485, -15.1048, 4.2263],
[ 33.6510, 43.4881, 3.3938, …, -19.5759, 6.9983, -38.9134],
…,
[ 7.1470, 22.2052, -16.5933, …, 6.5175, 19.8026, 28.3119],
[-16.4020, 12.0544, -53.4701, …, -15.9740, 4.4595, -32.3315],
[ 6.7833, 7.0182, -10.2338, …, -19.6627, 20.2132, 13.7222]],
[[ -3.7571, -7.8292, -4.9203, …, -3.0067, 27.1352, -14.4893],
[-16.7018, -74.0261, -24.2703, …, -6.4855, 5.9101, -1.9450],
[ -9.6489, -2.2309, 15.9349, …, 3.4376, 37.7087, 0.7485],
…,
[ 17.7427, -17.9123, -1.9595, …, 21.7801, 43.1779, -4.2342],
[ 19.6051, 3.6274, 3.0769, …, 47.5359, 23.2772, -10.5134],
[-10.2644, 0.4692, 24.5563, …, 39.0300, 10.2725, -18.6780]],
…]])

之后转到mindspore上，PYNATIVE_MODE模式下用NPU调试代码
依旧固定初始化参数和输入

class GELUConvBlock(nn.Cell):
    def __init__(self, in_ch, out_ch, group_size):
        super().__init__()
        self.conv = nn.Conv2d(in_ch,
                              out_ch,
                              3,
                              stride=1,
                              has_bias=False,
                              padding=1,
                              pad_mode='pad',
                              dtype=ms.float32)
        self.conv.weight.set_data(Tensor(weight_np, dtype=ms.float32))  # 强制初始化参数
        self.group_norm = nn.GroupNorm(group_size, out_ch)
        self.gelu = nn.GELU(approximate=False)

    def construct(self, x):
        x = self.conv(x)
        print(x)
        x = self.group_norm(x)
        x = self.gelu(x)
        return x

在经过卷积层处理

import torch  
torch.manual_seed(42)
ms.set_seed(42)

# 定义输入参数
batch_size = 1
in_ch = 64
out_ch = 32
group_size = 2  # GroupNorm的分组数（必须能整除out_ch）
height, width = 16, 16

# 生成固定随机输入（形状为 [batch_size, in_ch, height, width]）
input_tensor = torch.randn(batch_size, in_ch, height, width).detach().cpu().numpy()
input_tensor = Tensor(input_tensor, dtype = ms.float32)

# 初始化模块
block = GELUConvBlock(in_ch, out_ch, group_size)

# 前向传播
output = block(input_tensor)

# # 打印输出信息
# print(output)

[[[[ 12.88811 -13.965936 10.268403 … -12.48695 -14.285012 -5.3615685 ]
[ -6.010967 -3.7509742 0.27916503 … -16.662071 18.78757 -1.6597592 ]
[ 20.440784 -29.288181 -13.600904 … -45.567055 22.780708 -24.546972 ]
…
[-41.915035 -33.864902 -6.4803376 … -35.24102 -19.55607 -28.843998 ]
[-15.095564 -18.728704 9.1526985 … -7.071264 20.083946 1.6947057 ]
[ 20.78365 21.612846 -9.93457 … 8.392191 8.073755 19.496553 ]]
[[ 17.292835 10.740439 33.433903 … -15.270564 -2.2864969 12.634817 ]
[-17.921299 2.5554345 19.058323 … 16.08888 -15.051955 4.2758093 ]
[ 33.696594 43.529957 3.4286156 … -19.52608 7.0427046 -38.873337 ]
…]]]

误差在10的-2次方，模型后面还会用到大量的卷积，叠加之后误差会更大。该如何对齐和torch.nn.Conv2d的精度呢？

3 根因分析

----此处由用户填写----

4 解决方案