torch.nn.Conv2d和ms.nn.Conv2d精度对齐问题

jiaoguoning · 2025 年7 月 19 日 01:22

我在将torch架构下的模型，迁移到mindspore架构下的时候遇到的问题。

这是torch的代码：

class GELUConvBlock(nn.Module):
    def __init__(
        self, in_ch, out_ch, group_size):
        super().__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, 3, 1, 1)
        self.group_norm = nn.GroupNorm(group_size, out_ch)
        self.gelu = nn.GELU()
        
    def forward(self, x):
        x = self.conv(x)
        print(x)
        x = self.group_norm(x)
        x = self.gelu(x)
        return x

固定了nn.conv2d的权重。

import numpy as np

# 固定随机种子
np.random.seed(42)

# 卷积参数
in_ch = 64
out_ch = 32
kernel_size = 3

# 统一权重 shape: (out_channels, in_channels, kernel_h, kernel_w)
weight_np = np.random.randn(out_ch, in_ch, kernel_size, kernel_size).astype(np.float32)
print(weight_np)

得到权重如下

之后固定了输入得到经过卷积后的输出，代码和输出如下：

# 固定随机种子以保证可复现性
torch.manual_seed(42)

# 定义输入参数
batch_size = 1
in_ch = 64
out_ch = 32
group_size = 2  # GroupNorm的分组数（必须能整除out_ch）
height, width = 16, 16

# 生成固定随机输入（形状为 [batch_size, in_ch, height, width]）
input_tensor = torch.randn(batch_size, in_ch, height, width)
print(input_tensor.dtype)

# 初始化模块
block = GELUConvBlock(in_ch, out_ch, group_size)

with torch.no_grad():
    block.conv.weight.copy_(torch.tensor(weight_np))

# 前向传播
output = block(input_tensor)

# 打印输出信息
# print(output)

之后转到mindspore上，PYNATIVE_MODE模式下用NPU调试代码
依旧固定初始化参数和输入

class GELUConvBlock(nn.Cell):
    def __init__(self, in_ch, out_ch, group_size):
        super().__init__()
        self.conv = nn.Conv2d(in_ch,
                              out_ch,
                              3,
                              stride=1,
                              has_bias=False,
                              padding=1,
                              pad_mode='pad',
                              dtype=ms.float32)
        self.conv.weight.set_data(Tensor(weight_np, dtype=ms.float32))  # 强制初始化参数
        self.group_norm = nn.GroupNorm(group_size, out_ch)
        self.gelu = nn.GELU(approximate=False)

    def construct(self, x):
        x = self.conv(x)
        print(x)
        x = self.group_norm(x)
        x = self.gelu(x)
        return x

在经过卷积层处理

import torch  
torch.manual_seed(42)
ms.set_seed(42)

# 定义输入参数
batch_size = 1
in_ch = 64
out_ch = 32
group_size = 2  # GroupNorm的分组数（必须能整除out_ch）
height, width = 16, 16

# 生成固定随机输入（形状为 [batch_size, in_ch, height, width]）
input_tensor = torch.randn(batch_size, in_ch, height, width).detach().cpu().numpy()
input_tensor = Tensor(input_tensor, dtype = ms.float32)

# 初始化模块
block = GELUConvBlock(in_ch, out_ch, group_size)

# 前向传播
output = block(input_tensor)

# # 打印输出信息
# print(output)

得到结果如下

误差在10的-2次方，模型后面还会用到大量的卷积，叠加之后误差会更大。

我该如何对齐和torch.nn.Conv2d的精度呢？

另外：

mindspore 2.2.14
torch 2.7.1

zhouyifengCode · 2025 年7 月 19 日 01:29

这俩默认参数是有差异的，文档上有说明：

如果你用mindspore的最新版的话，可以试试ms.mint.nn.Conv2d，根据文档描述，这俩是一致的

zhouyifengCode · 2025 年7 月 19 日 01:43

GroupNorm也有差异：

如果最新版本的话，也可以使用 mindspore.mint.nn.GroupNorm：

jiaoguoning · 2025 年7 月 19 日 01:57

感谢，要用最新版本的啊。不过我刚把驱动、固件、CANN、mindspore这一套版本对齐了，心力憔悴

zhouyifengCode · 2025 年7 月 19 日 01:59

那些有差异的，我印象中也是可以对其的，就是他的默认参数不一样，你要手动设置下那些参数的默认值，保持和torch一致，前几年做模型迁移的时候我也用过，应该也有办法对其

jiaoguoning · 2025 年7 月 19 日 02:11

对上了，我的偏置和torch.nn.Conv2d的不一样，他是默认为True的，而我把偏置去掉了。把两边的偏置置固定之后，这两遍的误差就对齐了。谢谢。

class GELUConvBlock(nn.Cell):
    def __init__(self, in_ch, out_ch, group_size):
        super().__init__()
        self.conv = nn.Conv2d(in_ch,
                              out_ch,
                              3,
                              stride=1,
                              has_bias=True,
                              padding=1,
                              pad_mode='pad',
                              dtype=ms.float32)
        self.conv.weight.set_data(Tensor(weight_np, dtype=ms.float32))  # 强制初始化参数一致
        self.conv.bias.set_data(Tensor(bias_np, dtype=ms.float32))  # 强制偏置置一致
        self.group_norm = nn.GroupNorm(group_size, out_ch)
        self.gelu = nn.GELU(approximate=False)

话题		回复	浏览量
[报错活动]将torch架构的模型迁移到mindspore架构中时精度不一致功能调试-Function Debugging	0	34	2025 年8 月 5 日
将torch架构的模型迁移到mindspore架构中时精度不一致其他干货-Others	1	38	2025 年8 月 5 日
使用mindpsore.nn.conv3d在GPU上精度不足问题调优经验-Tuning Experience	0	24	2025 年7 月 30 日
Conv2d 执行时间较慢问题求助 Help 模型 , 性能	6	51	2025 年10 月 22 日
迁移网络任务-tacotron2时mindspore的权重初始化与torch的不一致功能调试-Function Debugging	0	13	2025 年9 月 14 日

torch.nn.Conv2d和ms.nn.Conv2d精度对齐问题

相关话题