论坛报错活动第三十八期-MindSpore自定义算子梯度计算不正确Loss异常

longvoyage · 2025 年10 月 27 日 13:02

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: MindSpore=2.6.0
执行模式（PyNative/ Graph）: 不限
Python版本: Python=3.10
操作系统平台: linux

2 报错信息

2.1 问题描述

在MindSpore中实现了一个自定义算子，前向计算正常，但在训练时发现梯度计算不正确，导致模型无法正常收敛。
前向传播结果与NumPy实现一致，反向传播梯度与数值梯度差异很大，训练时loss不下降或出现NaN。

2.2 脚本信息

自定义算子代码

import mindspore as ms
from mindspore import nn, ops
from mindspore.ops import custom_op_utils

class CustomOp(nn.Cell):
    def __init__(self):
        super().__init__()
   
    def construct(self, x):
        # 前向计算实现
        return x * ops.sin(x)
   
    def bprop(self, x, out, dout):
        # 反向传播实现
        return (dout * (ops.sin(x) + x * ops.cos(x)), )


# 注册自定义算子
custom_op_utils.reg_op('custom_op', CustomOp)

梯度验证代码

# 数值梯度计算
def numerical_gradient(f, x, eps=1e-6):
    grad = ops.zeros_like(x)
    for i in range(x.size):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus.flat[i] += eps
        x_minus.flat[i] -= eps
        grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * eps)
    return grad

# 对比自定义算子梯度
x = ms.Tensor(np.random.randn(10), dtype=ms.float32)
custom_op = CustomOp()

# 自定义算子梯度
grad_auto = ms.grad(custom_op)(x)

# 数值梯度
def f(x):
    return custom_op(x).sum()

grad_numerical = numerical_gradient(f, x.asnumpy())

print(f"自动梯度: {grad_auto}")
print(f"数值梯度: {grad_numerical}")
print(f"差异: {np.abs(grad_auto.asnumpy() - grad_numerical).max()}")

3 根因分析

既然是自动梯度,为什么还要写bprop实现.直接自动不就可以了?

还有数值梯度里面的不是原函数的反向实现

此外,原代码里面custom_op_utils 在api都查找不到.

4 解决方案

去掉bprop实现,修改numerical_gradient实现.

import mindspore as ms
from mindspore import nn, ops
import numpy as np
#from mindspore.ops import custom_op_utils

class CustomOp(nn.Cell):
    def __init__(self):
        super().__init__()

    def construct(self, x):
        # 前向计算实现
        return x * ops.sin(x)

    #def bprop(self, x, out, dout):
    #    # 反向传播实现
    #    return (dout * (ops.sin(x) + x * ops.cos(x)), )


# 注册自定义算子
#custom_op_utils.reg_op('custom_op', CustomOp)
#梯度验证代码
# 数值梯度计算

def numerical_gradient(f, x):
    return ops.sin(x) + x * ops.cos(x)



# 对比自定义算子梯度
x = ms.Tensor(np.random.randn(10), dtype=ms.float32)
custom_op = CustomOp()

# 自定义算子梯度
grad_auto = ms.grad(custom_op)(x)

# 数值梯度
def f(x):
    return custom_op(x).sum()

grad_numerical = numerical_gradient(f, x)

print(f"自动梯度: {grad_auto}")
print(f"数值梯度: {grad_numerical}")
print(f"差异: {np.abs(grad_auto.asnumpy() - grad_numerical).max()}")

执行结果如下:

自动梯度: [ 0.10518813 -1.3343601   1.3910065   1.3183469  -0.6285746  -1.2344753
 -0.67456186  0.14373049 -1.0919948   0.2005248 ]
数值梯度: [ 0.10518813 -1.3343601   1.3910065   1.3183469  -0.6285746  -1.2344753
 -0.67456186  0.14373049 -1.0919948   0.2005248 ]
差异: 0.0

话题		回复	浏览量
MindSpore自定义算子梯度计算不正确Loss异常模型训练-Model Training	1	29	2025 年10 月 27 日
使用MindSpore实现pytorch中的前反向传播功能调试-Function Debugging	0	35	2025 年9 月 8 日
形参与实参的不对应导致ops.GradOperation执行报错：The parameters number of the function is 2, but the number of provided arguments is 1. 模型训练-Model Training	0	35	2025 年8 月 20 日
MindSpore动态图模式下梯度计算报错AttributeError: module 'mindspore' has no attribute 'value_and_grad' 模型训练-Model Training	1	19	2026 年3 月 1 日
自定义ops.Custom报错TypeError: function output_tensor expects two inputs, but get 1 模型训练-Model Training	0	39	2025 年8 月 31 日