Conv2d 执行时间较慢

xiuguangli · 2025 年10 月 16 日 10:53

和 torch 相比，Conv2d 执行时间较慢

环境

GPU：nvidia 5880 ada
mindspore 版本：2.6.0.dev20250323
torch 版本：2.5.1+cu124

具体问题

在相同输入，相同模型网络情况下，mindspore 的执行时间较慢，不知道升腾设备上是否会这样。

以下为复现代码。

def compare_conv2d():
    import torch
    import torch.nn as tnn
    import mindspore as ms
    import mindspore.nn as mnn
    import numpy as np
    import time
    
    gpu_id = 3
    ms.set_context(device_id=gpu_id, mode=ms.PYNATIVE_MODE,device_target="GPU")
    device = torch.device(f'cuda:{gpu_id}' if torch.cuda.is_available() else 'cpu')
    
    # 参数设置
    batch_size, h, w = 16, 256, 256
    in_channels, out_channels, kernel_size = 3, 8, 3
    epochs = 500000

    # 输入数据，全为1.5
    torch_input = (torch.ones((batch_size, in_channels, h, w)) * 1.5).to(device)
    ms_input = ms.Tensor(np.ones((batch_size, in_channels, h, w), dtype=np.float32) * 1.5)

    # 网络定义
    torch_conv = tnn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=1, bias=True).to(device)
    ms_conv = mnn.Conv2d(in_channels, out_channels, kernel_size, stride=1, pad_mode='pad', padding=1, has_bias=True)

    # 权重和偏置同步
    with torch.no_grad():
        torch_conv.weight.fill_(0.5)
        torch_conv.bias.fill_(0.1)
    ms_conv.weight.set_data(ms.Tensor(np.full(ms_conv.weight.shape, 0.5, dtype=np.float32)))
    ms_conv.bias.set_data(ms.Tensor(np.full(ms_conv.bias.shape, 0.1, dtype=np.float32)))

    # 循环进行多次前向计算
    start_time = time.time()
    for i in range(epochs):
        torch_out = torch_conv(torch_input)
    print(f"Torch 共 {epochs} 次推理时间: {time.time() - start_time}")

    start_time = time.time()
    for i in range(epochs):
        ms_out = ms_conv(ms_input)
    print(f"MindSpore 共 {epochs} 次推理时间: {time.time() - start_time}")

以下为运行结果：

Torch 共 500000 次推理时间: 87.73985862731934
MindSpore 共 500000 次推理时间: 99.30602216720581

如果 epoch 较小，ms 和 torch 之间的执行时间差异会更大。
好像是 ms 会做预编译？当 epoch 为 1 时，时间差异最大，ms的单次执行时间最大。随着 epoch 增大，ms 的单次执行时间也减小了。

需求

这个是 gpu 特有的问题吗？有办法设置像torch这样没有预编译的吗？或者说，能够加速执行时间的。我的项目中epoch数较小，时间耗费较大。

chengxiaoli · 2025 年10 月 17 日 00:33

用户您好，欢迎使用MindSpore，已经收到您上述的问题，还请耐心等待下答复~

longvoyage · 2025 年10 月 17 日 09:24

用 mode=ms.GRAPH_MODE试试

xiuguangli · 2025 年10 月 17 日 11:57

设置成 ms.GRAPH_MODE ，会更慢的。

以下是设置成 ms.GRAPH_MODE 的代码，及输出。
相比上面的仅更改了
ms.set_context(device_id=gpu_id, mode=ms.GRAPH_MODE,device_target=“GPU”)

代码：

def compare_conv2d():
    import torch
    import torch.nn as tnn
    import mindspore as ms
    import mindspore.nn as mnn
    import numpy as np
    import time
    import paddle
    import paddle.nn as pnn

    gpu_id = 3
    # ms.set_context(device_id=gpu_id, mode=ms.PYNATIVE_MODE,device_target="GPU")
    ms.set_context(device_id=gpu_id, mode=ms.GRAPH_MODE,device_target="GPU")
    device = torch.device(f'cuda:{gpu_id}' if torch.cuda.is_available() else 'cpu')
    paddle.set_device(f'gpu:{gpu_id}' if paddle.is_compiled_with_cuda() else 'cpu')

    # 参数设置
    batch_size, h, w = 16, 256, 256
    in_channels, out_channels, kernel_size = 3, 8, 3
    epochs = 500000

    # 输入数据，全为1.5
    torch_input = (torch.ones((batch_size, in_channels, h, w)) * 1.5).to(device)
    ms_input = ms.Tensor(np.ones((batch_size, in_channels, h, w), dtype=np.float32) * 1.5)
    paddle_input = paddle.to_tensor(np.ones((batch_size, in_channels, h, w), dtype=np.float32) * 1.5)

    # 网络定义
    torch_conv = tnn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=1, bias=True).to(device)
    ms_conv = mnn.Conv2d(in_channels, out_channels, kernel_size, stride=1, pad_mode='pad', padding=1, has_bias=True)
    paddle_conv = pnn.Conv2D(in_channels, out_channels, kernel_size, stride=1, padding=1)

    # 权重和偏置同步
    with torch.no_grad():
        torch_conv.weight.fill_(0.5)
        torch_conv.bias.fill_(0.1)
    ms_conv.weight.set_data(ms.Tensor(np.full(ms_conv.weight.shape, 0.5, dtype=np.float32)))
    ms_conv.bias.set_data(ms.Tensor(np.full(ms_conv.bias.shape, 0.1, dtype=np.float32)))
    paddle_conv.weight.set_value(paddle.to_tensor(np.full(paddle_conv.weight.shape, 0.5, dtype=np.float32)))
    paddle_conv.bias.set_value(paddle.to_tensor(np.full(paddle_conv.bias.shape, 0.1, dtype=np.float32)))

    # 循环进行多次前向计算
    start_time = time.time()
    for i in range(epochs):
        torch_out = torch_conv(torch_input)
    print(f"Torch 共 {epochs} 次推理时间: {time.time() - start_time}")

    start_time = time.time()
    for i in range(epochs):
        ms_out = ms_conv(ms_input)
    print(f"MindSpore 共 {epochs} 次推理时间: {time.time() - start_time}")

    start_time = time.time()
    for i in range(epochs):
        paddle_out = paddle_conv(paddle_input)
    print(f"Paddle 共 {epochs} 次推理时间: {time.time() - start_time}")

输出


Torch 共 500000 次推理时间: 88.99779653549194
MindSpore 共 500000 次推理时间: 1835.388904094696
Paddle 共 500000 次推理时间: 91.04123115539551

longvoyage · 2025 年10 月 20 日 01:20

xiuguangli:

or i in range(epochs):
        ms_out = ms_conv(ms_input)
    print(f"MindSpore 共 {epochs} 次推理时间: {time.time() - start_time}")

    start_time = time.time()
    for i in range(epochs):
        paddle_out = paddle_conv(paddle_input)
    print(f"Paddle 共 {epochs} 次推理时间: {time.time() - start_time}")

不要这么测试性能,应该是创建一个网络,然后在网络里面循环调用.而不是在外面循环调用nn接口.两者是有区别的.

大概是下面这么写的意思.

class Network(nn.Cell):
    def __init__(self):
        super().__init__()
        self.conv2d = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, pad_mode='pad', padding=1, has_bias=True)

    def construct(self, x, epochs):
        for i in range(epochs):
            x = self.conv2d(x)
        return x

chengxiaoli · 2025 年10 月 22 日 01:33

用户您好，MindSpore支撑人已经分析并给出了问题的原因，由于较长时间未看到您采纳回答，这里版主将进行采纳回答的结帖操作，如果还其他疑问请发新帖子提问，谢谢支持~

system · 2025 年10 月 22 日 02:33

此话题已在最后回复的 60 分钟后被自动关闭。不再允许新回复。

话题		回复	浏览量
torch.nn.Conv2d和ms.nn.Conv2d精度对齐问题问题求助 Help 模型 , 调试 , api	5	47	2025 年7 月 19 日
使用MindSpore静态图速度慢的问题模型训练-Model Training	0	13	2025 年9 月 1 日
MindSpore报错The graph generated form MindIR is not support to execute in the PynativeMode,please convert to the GraphMode 模型训练-Model Training	0	13	2025 年8 月 20 日
【MindSpore报错解决地图】常见报错问题和解决方案（持续更新）经验分享 Tech Blogs	3	57	2025 年11 月 17 日
[报错活动]将torch架构的模型迁移到mindspore架构中时精度不一致功能调试-Function Debugging	0	18	2025 年8 月 5 日

Conv2d 执行时间较慢

和 torch 相比，Conv2d 执行时间较慢

环境

具体问题

需求

相关话题