1 系统环境
硬件环境(Ascend/GPU/CPU): GPU
MindSpore版本: 2.0.0-alpha
执行模式(PyNative/ Graph): PyNative
Python版本: 3.8
操作系统平台:Linux
2 报错信息
2.1 问题描述
CSRTensor 矩阵乘法在GPU下计算报错
2.2 报错信息
RuntimeError:CUDA Error: cudaMemcpy failed. | Error Number: 700 an illegal memory access was encountered复制
2.3 脚本代码
import mindspore as ms
a = ms.numpy.ones((7600, 7600), ms.float32).to_csr()
b = ms.numpy.ones((7600, 7600), ms.float32).to_csr()
a.mm(b)
3 根因分析
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/python3.7/site-packages/mindspore/common/sparse_tensor.py", line 825, in mm
return tensor_operator_registry.get("csr_mm")(self, matrix)
File "/python3.7/site-packages/mindspore/ops/function/sparse_func.py", line 299, in csr_mm
b.values)
File "/python3.7/site-packages/mindspore/ops/primitive.py", line 316, in __call__
return _run_op(self, self.name, args)
File "/python3.7/site-packages/mindspore/common/api.py", line 97, in wrapper
results = fn(*arg, **kwargs)
File "/python3.7/site-packages/mindspore/ops/primitive.py", line 808, in _run_op
output = _pynative_executor.real_run_op(obj, op_name, args)
File "/python3.7/site-packages/mindspore/common/api.py", line 977, in real_run_op
return self._executor.real_run_op(*args)
RuntimeError: CUDA Error: SparseMatrixSparseMatMul cudaStreamSynchronized failed | Error Number: 700 an illegal memory access was encountered
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/sparse/sparse_matrix_sparse_matmul_gpu_kernel.cc:439 SyncData
a.mm(b)最终调用的是ops.csr_mm
所以问题等价于
import mindspore as ms
from mindspore import Tensor, CSRTensor
from mindspore import dtype as mstype
import mindspore.ops as ops
a = ms.numpy.ones((7600, 7600), ms.float32).to_csr()
b = ms.numpy.ones((7600, 7600), ms.float32).to_csr()
c = ops.csr_mm(a, b)
上面代码同样会复现这个问题。
报错代码
void SparseMatrixSparseMatMulGpuKernelMod::SyncData() {
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(cudaStreamSynchronize(stream),
"SparseMatrixSparseMatMul cudaStreamSynchronized failed");
mindspore/ccsrc/plugin/device/gpu/kernel/sparse/sparse_matrix_sparse_matmul_gpu_kernel.cc:439
如下:
cudaStreamSynchronize报错,应该是把数据同步到卡的时候出错了。且出错后接下来的其他操作也影响了。
mindspore/core/ops/sparse_matrix_sparse_mat_mul.cc
稀疏矩阵相乘调用的是如下Infer
AbstractBasePtr SparseMatrixSparseMatMulInfer(const abstract::AnalysisEnginePtr &, const PrimitivePtr &primitive,
const std::vector<AbstractBasePtr> &input_args) {
MS_EXCEPTION_IF_NULL(primitive);
for (const auto &item : input_args) {
MS_EXCEPTION_IF_NULL(item);
}
auto prim_name = primitive->name();
const int64_t input_num = 10;
(void)CheckAndConvertUtils::CheckInteger("input number", SizeToLong(input_args.size()), kEqual, input_num, prim_name);
auto infer_type = SparseMatrixSparseMatMulInferType(primitive, input_args);
auto infer_shape = SparseMatrixSparseMatMulInferShape(primitive, input_args);
return abstract::MakeAbstract(infer_shape, infer_type);
}
再看SparseMatrixSparseMatMulInferShape这个函数
abstract::ShapePtr y_row_shape = nullptr;
abstract::ShapePtr y_col_shape = nullptr;
abstract::ShapePtr y_values_shape = nullptr;
ShapeVector col_shape = {abstract::Shape::kShapeDimAny};
ShapeVector values_shape = {abstract::Shape::kShapeDimAny};
ShapeVector infer_shape_min = {1};
ShapeVector infer_shape_max = {MAX_LENGTH};
y_col_shape = std::make_shared<abstract::Shape>(col_shape, infer_shape_min, infer_shape_max);
y_values_shape = std::make_shared<abstract::Shape>(values_shape, infer_shape_min, infer_shape_max);
ShapeVector infer_shape_max = {MAX_LENGTH};
然后有相关定义:
const int MAX_LENGTH = 100000;
这边的shape创建的智能指针,当矩阵shape过大的时候,就会超出范围,当进行cuda运算的时候就会出错了。
4 解决方案
既然是数据量大的时候同步数据报错,那就减小数据量就可以了
import mindspore as ms
a = ms.numpy.ones((760, 760), ms.float32).to_csr()
b = ms.numpy.ones((760, 760), ms.float32).to_csr()
a.mm(b)
或者是b 设置为非稀疏矩阵,因为如果都设置为1的情况下,稀疏矩阵等同于稠密矩阵。
import mindspore as ms
a = ms.numpy.ones((7600, 7600), ms.float32).to_csr()
b = ms.numpy.ones((7600, 7600), ms.float32)
a.mm(b)