Conv2d 执行时间较慢

和 torch 相比,Conv2d 执行时间较慢

环境

  1. GPU:nvidia 5880 ada
  2. mindspore 版本:2.6.0.dev20250323
  3. torch 版本:2.5.1+cu124

具体问题

在相同输入,相同模型网络情况下,mindspore 的执行时间较慢,不知道升腾设备上是否会这样。

以下为复现代码。

def compare_conv2d():
    import torch
    import torch.nn as tnn
    import mindspore as ms
    import mindspore.nn as mnn
    import numpy as np
    import time
    
    gpu_id = 3
    ms.set_context(device_id=gpu_id, mode=ms.PYNATIVE_MODE,device_target="GPU")
    device = torch.device(f'cuda:{gpu_id}' if torch.cuda.is_available() else 'cpu')
    
    # 参数设置
    batch_size, h, w = 16, 256, 256
    in_channels, out_channels, kernel_size = 3, 8, 3
    epochs = 500000

    # 输入数据,全为1.5
    torch_input = (torch.ones((batch_size, in_channels, h, w)) * 1.5).to(device)
    ms_input = ms.Tensor(np.ones((batch_size, in_channels, h, w), dtype=np.float32) * 1.5)

    # 网络定义
    torch_conv = tnn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=1, bias=True).to(device)
    ms_conv = mnn.Conv2d(in_channels, out_channels, kernel_size, stride=1, pad_mode='pad', padding=1, has_bias=True)

    # 权重和偏置同步
    with torch.no_grad():
        torch_conv.weight.fill_(0.5)
        torch_conv.bias.fill_(0.1)
    ms_conv.weight.set_data(ms.Tensor(np.full(ms_conv.weight.shape, 0.5, dtype=np.float32)))
    ms_conv.bias.set_data(ms.Tensor(np.full(ms_conv.bias.shape, 0.1, dtype=np.float32)))

    # 循环进行多次前向计算
    start_time = time.time()
    for i in range(epochs):
        torch_out = torch_conv(torch_input)
    print(f"Torch 共 {epochs} 次推理时间: {time.time() - start_time}")

    start_time = time.time()
    for i in range(epochs):
        ms_out = ms_conv(ms_input)
    print(f"MindSpore 共 {epochs} 次推理时间: {time.time() - start_time}")


以下为运行结果:

Torch 共 500000 次推理时间: 87.73985862731934
MindSpore 共 500000 次推理时间: 99.30602216720581

如果 epoch 较小,ms 和 torch 之间的执行时间差异会更大。
好像是 ms 会做预编译?当 epoch 为 1 时,时间差异最大,ms的单次执行时间最大。随着 epoch 增大,ms 的单次执行时间也减小了。

需求

这个是 gpu 特有的问题吗?有办法设置像torch这样没有预编译的吗?或者说,能够加速执行时间的。我的项目中epoch数较小,时间耗费较大。

用户您好,欢迎使用MindSpore,已经收到您上述的问题,还请耐心等待下答复~

用 mode=ms.GRAPH_MODE试试

设置成 ms.GRAPH_MODE ,会更慢的。

以下是设置成 ms.GRAPH_MODE 的代码,及输出。
相比上面的仅更改了
ms.set_context(device_id=gpu_id, mode=ms.GRAPH_MODE,device_target=“GPU”)

代码:

def compare_conv2d():
    import torch
    import torch.nn as tnn
    import mindspore as ms
    import mindspore.nn as mnn
    import numpy as np
    import time
    import paddle
    import paddle.nn as pnn

    gpu_id = 3
    # ms.set_context(device_id=gpu_id, mode=ms.PYNATIVE_MODE,device_target="GPU")
    ms.set_context(device_id=gpu_id, mode=ms.GRAPH_MODE,device_target="GPU")
    device = torch.device(f'cuda:{gpu_id}' if torch.cuda.is_available() else 'cpu')
    paddle.set_device(f'gpu:{gpu_id}' if paddle.is_compiled_with_cuda() else 'cpu')

    # 参数设置
    batch_size, h, w = 16, 256, 256
    in_channels, out_channels, kernel_size = 3, 8, 3
    epochs = 500000

    # 输入数据,全为1.5
    torch_input = (torch.ones((batch_size, in_channels, h, w)) * 1.5).to(device)
    ms_input = ms.Tensor(np.ones((batch_size, in_channels, h, w), dtype=np.float32) * 1.5)
    paddle_input = paddle.to_tensor(np.ones((batch_size, in_channels, h, w), dtype=np.float32) * 1.5)

    # 网络定义
    torch_conv = tnn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=1, bias=True).to(device)
    ms_conv = mnn.Conv2d(in_channels, out_channels, kernel_size, stride=1, pad_mode='pad', padding=1, has_bias=True)
    paddle_conv = pnn.Conv2D(in_channels, out_channels, kernel_size, stride=1, padding=1)

    # 权重和偏置同步
    with torch.no_grad():
        torch_conv.weight.fill_(0.5)
        torch_conv.bias.fill_(0.1)
    ms_conv.weight.set_data(ms.Tensor(np.full(ms_conv.weight.shape, 0.5, dtype=np.float32)))
    ms_conv.bias.set_data(ms.Tensor(np.full(ms_conv.bias.shape, 0.1, dtype=np.float32)))
    paddle_conv.weight.set_value(paddle.to_tensor(np.full(paddle_conv.weight.shape, 0.5, dtype=np.float32)))
    paddle_conv.bias.set_value(paddle.to_tensor(np.full(paddle_conv.bias.shape, 0.1, dtype=np.float32)))

    # 循环进行多次前向计算
    start_time = time.time()
    for i in range(epochs):
        torch_out = torch_conv(torch_input)
    print(f"Torch 共 {epochs} 次推理时间: {time.time() - start_time}")

    start_time = time.time()
    for i in range(epochs):
        ms_out = ms_conv(ms_input)
    print(f"MindSpore 共 {epochs} 次推理时间: {time.time() - start_time}")

    start_time = time.time()
    for i in range(epochs):
        paddle_out = paddle_conv(paddle_input)
    print(f"Paddle 共 {epochs} 次推理时间: {time.time() - start_time}")

输出


Torch 共 500000 次推理时间: 88.99779653549194
MindSpore 共 500000 次推理时间: 1835.388904094696
Paddle 共 500000 次推理时间: 91.04123115539551

不要这么测试性能,应该是创建一个网络,然后在网络里面循环调用.而不是在外面循环调用nn接口.两者是有区别的.

大概是下面这么写的意思.

class Network(nn.Cell):
    def __init__(self):
        super().__init__()
        self.conv2d = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, pad_mode='pad', padding=1, has_bias=True)

    def construct(self, x, epochs):
        for i in range(epochs):
            x = self.conv2d(x)
        return x

用户您好,MindSpore支撑人已经分析并给出了问题的原因,由于较长时间未看到您采纳回答,这里版主将进行采纳回答的结帖操作,如果还其他疑问请发新帖子提问,谢谢支持~

此话题已在最后回复的 60 分钟后被自动关闭。不再允许新回复。