MindSpore自定义算子梯度计算不正确Loss异常

chengxiaoli · 2025 年10 月 27 日 06:17

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: MindSpore=2.6.0
执行模式（PyNative/ Graph）: 不限
Python版本: Python=3.10
操作系统平台: linux

2 报错信息

2.1 问题描述

在MindSpore中实现了一个自定义算子，前向计算正常，但在训练时发现梯度计算不正确，导致模型无法正常收敛。
前向传播结果与NumPy实现一致，反向传播梯度与数值梯度差异很大，训练时loss不下降或出现NaN。

2.2 脚本信息

自定义算子代码

import mindspore as ms
from mindspore import nn, ops
from mindspore.ops import custom_op_utils

class CustomOp(nn.Cell):
    def __init__(self):
        super().__init__()
   
    def construct(self, x):
        # 前向计算实现
        return x * ops.sin(x)
   
    def bprop(self, x, out, dout):
        # 反向传播实现
        return (dout * (ops.sin(x) + x * ops.cos(x)), )


# 注册自定义算子
custom_op_utils.reg_op('custom_op', CustomOp)

梯度验证代码

# 数值梯度计算
def numerical_gradient(f, x, eps=1e-6):
    grad = ops.zeros_like(x)
    for i in range(x.size):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus.flat[i] += eps
        x_minus.flat[i] -= eps
        grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * eps)
    return grad

# 对比自定义算子梯度
x = ms.Tensor(np.random.randn(10), dtype=ms.float32)
custom_op = CustomOp()

# 自定义算子梯度
grad_auto = ms.grad(custom_op)(x)

# 数值梯度
def f(x):
    return custom_op(x).sum()

grad_numerical = numerical_gradient(f, x.asnumpy())

print(f"自动梯度: {grad_auto}")
print(f"数值梯度: {grad_numerical}")
print(f"差异: {np.abs(grad_auto.asnumpy() - grad_numerical).max()}")

3 根因分析

此处由用户填写

4 解决方案

此处由用户填写
包含文字方案和最终脚本代码请将正确的脚本打包并上传附件

longvoyage · 2025 年10 月 27 日 13:03

干货地址:论坛报错活动第三十八期-MindSpore自定义算子梯度计算不正确Loss异常

话题		回复	浏览量
论坛报错活动第三十八期-MindSpore自定义算子梯度计算不正确Loss异常功能调试-Function Debugging	0	23	2025 年10 月 27 日
使用MindSpore实现pytorch中的前反向传播功能调试-Function Debugging	0	35	2025 年9 月 8 日
MindSpore动态图模式下梯度计算报错AttributeError: module 'mindspore' has no attribute 'value_and_grad' 模型训练-Model Training	1	19	2026 年3 月 1 日
形参与实参的不对应导致ops.GradOperation执行报错：The parameters number of the function is 2, but the number of provided arguments is 1. 模型训练-Model Training	0	35	2025 年8 月 20 日
如何使用MindSpore实现梯度对数据求导retain_graph=True 模型训练-Model Training	0	18	2025 年9 月 2 日

MindSpore自定义算子梯度计算不正确Loss异常

1 系统环境

2 报错信息

2.1 问题描述

2.2 脚本信息

3 根因分析

4 解决方案

相关话题