昇思MindSpore论坛

重磅干货！MindSpore大模型报错解决地图来啦（持续更新中）

经验分享 Tech Blogs

Skyti (Skyti) 2025 年10 月 24 日 03:41 1

MindSpore Transformers(https://www.mindspore.cn/mindformers/docs/zh-CN/master/index.html)套件目标是构建一个大模型训练、微调、评估、推理、部署的全流程开发套件，是基于MindSpore内置的并行技术和组件化设计。本帖根据使用MindSpore大模型过程中积累的经验，总结了包含报错信息和解决方案的常见案例（当前仅涵盖迁移过程中的问题，持续更新中），可根据报错信息找到对应的解决方案。

报错分析参考内容：报错分析 | MindSpore master 教程 | 昇思MindSpore社区

分类一：环境问题案例

1.1 MindSpore2.2.10使用Flash attention特性报错AttributeError: module ‘mindspore.nn’has no attribute’FlashAttention’

1.2 MindSpore保存模型提示：need to checkwhether you is batch size and so on in the ‘net’ and ‘parameter dict’ are same.

1.3 MindSpore开启profiler功能报错IndexError:list index out of range

1.4 MindSpore多机运行Profiler报错ValueError: not enough values to unpack (expected 4, got 0)

1.5 MindSpore在yaml文件的callbacks中配置SummaryMonitor后，开启summary功能失效

1.6 MindSpore模型权重功能无法保存更新后的权重

1.7 MindSpore开启MS_DISABLE_REF_MODE导致报错The device address type is wrong:type name in address:CPU,type name in context:Ascend

1.8 MindSpore和tbe版本不匹配问题及解决

1.9 Ascend上构建MindSpore报has no member named ‘update output desc dpse’ ;did you mean ‘update_output_desc_dq’?

1.10 MindSpore开启profile，使用并行策略报错ValueError: When dıstrıbuted loads are slıced we1ghts, sınk mode must be set True.

1.11 模型启动时报Malloc device memory failed

1.12 MindSpore盘古模型报错Failed to allocate memory.Possible Cause: Available memory is insufficient.

1.13 Ascend910环境分离部署时请求超时

分类二：配置问题案例

2.1 MindSpore报错ImportError cannot import name "build dataset loader’ from ‘mindformers.dataset. dataloader’

2.2 MindSpore大模型报错xxx is not a supported default model or a valid path to checkpoint.

2.3 llama2模型转换报错ImportError: cannot import name ‘swap_cache’ from ‘mindspore._c_expression’

2.4 MindSpore报错BrokenPipeError: [Errno 32] Broken pipe

2.5 昇腾910上CodeLlama报错Session options is not equal in diff config infos when models’ weights are shared, last session options

2.6 集成通信库初始化init()报错要求使用mpirun启动多卡训练

2.7 INFNAN模式溢出问题

分类三：数据处理案例

3.1 Mindrecoder 格式转换报错ValueError: For ‘Mul’. x.shape and y.shape need to broadcast.

3.2 数据处理成Mindrecord数据时出现ImportError: cannot import name ‘tik’ from ‘te’

3.3 MindSpore数据并行报错Call GE RunGraphWithStreamAsync Failed，EL0004: Failed to allocate memory.

3.4 MindSpore训练报错TypeError: Invalid Kernel Build Info! Kernel type: AICPU_KERNEL, node: Default/Concat-op1

3.5 MindSpore模型正向报错Sync stream failed:Ascend_0，plog显示Gather算子越界

3.6 model.train报错Exception in training: The input value must be int and must > 0, but got ‘0’ with type ‘int’.

3.7 数据集异常导致编译(model.build)或者训练(model.train)卡住

3.8 baichaun2-13b 在Ascend910上持续溢出

3.9 MTP数据集分布式读写锁死，Failed to execute the sql [SELECT NAME from SHARD NAME;] while verifying meta file, database is locked]

3.10 MTP任务卡死，平台报错信息[‘ROOT_CLUSTER’] job failed.

3.11 大模型获取性能数据方法

3.12 模型解析性能数据

3.13 数据集处理结果精细对比

3.14 通过优化数据来加速训练速度

分类四：并行问题案例

4.1 MindSpore分布式模型并行报错：operator Mul init failed或者CheckStrategy failed.

4.2 MindSpore跑模型并行报错ValueError: array split does not result in an equal division

4.3 MindSpore训练大模型报错：BrokenPipeError: [Errno 32] Broken pipe, EOFError

4.4 MindSpore大模型并行需要在对应的yaml里面做哪些配置

4.5 模型并行显示内存溢出

4.6 MindSpore模型Pipeline并行发现有些卡的log中loss为0

4.7 MindSpore8卡报Socket times out问题

4.8 MindSpore模型Pipeline并行训练报错RuntimeError: Stage 0 should has at least 1 parameter. but got none.

4.9 并行策略为8:1:1时报错RuntimeError: May you need to check if the batch size etc. in your ‘net’ and ‘parameter dict’ are same.

4.10 模型并行策略为 1:1:8 时报错RuntimeError: Stage num is 8 is not equal to stage used: 5

4.11 单机4卡分布式推理失报错RuntimeError: Ascend kernel runtime initialization failed. The details refer to ‘Ascend Error Message’.

4.12 MindSpore跑分布式报错TypeError: The parameters number of the function is 636, but the number of provided arguments is 635.

4.13 MindSpore分布式8节点报错Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295

4.14 MindFormers进行单机八卡调用时报错No parameter is entered. Notice that the program will run on default 8 cards.

4.15 MindSpore分布式并行报错The strategy is XXX, shape XXX cannot be divisible by strategy value XXX

4.16 MTP使用多进程生成mindrecord，报错RuntimeError: Unexpected error. [Internal ERROR] Failed to write mindrecord meta files.

4.17 流水线并行报错Reshape op can’t be a border.

4.18 增加数据并行数之后模型占用显存增加

4.19 MindSpore分布式ckpt权重A转换为其他策略的分布式权重B

4.20 流水线并行没开Cell共享导致编译时间很长

4.21 多机训练报错：import torch_npu._C ImportError: libascend_hal.so: cannot open shared object file: No such file or directory

4.22 docker执行报错：RuntimeError: Maybe you are trying to call ‘mindspore.communication.init()’ without using ‘mpirun’

分类五：训练推理问题案例

5.1 MindSpore报错：TypeError: Multiply values for specific argument: query_embeds

5.2 Tokenizer指向报错TypeError GPT2Tokenizer: init () missing 2 required positional arguments: ‘vocab_file’ and "merges_file

5.3 Tokenizer文件缺失报错TypeError:_init() missing 2 required positional arguments: ‘vocab_file’ and ‘merge_file’

5.4 模型推理报错RuntimeError A model class needs to define a `prepare inputs fordgeneration` method in order to use .generate()`

5.5 MindSpore模型推理报错：memory isn’t enough and alloc failed, kernel name: kernel_graph_@ HostDSActor, alloc size: 8192B

5.6 MindSpore模型推理报错TypeError: cell_reuse() takes 0 positional arguments but 1 was given

5.7 运行wizardcoder迁移代码报错broken pipe

5.8 MindSpore Lite模型加载报错RuntimeError: build from file failed! Error is Common error code.

5.9 Mindspore在自回归推理时的精度对齐设置

5.10 MindSpore大模型打开pp并行或者梯度累积之后loss不溢出也不收敛

5.11 Ascend上用ADGEN数据集评估时报错not support in PyNative RunOp!

5.12 qwen1.5_1.8B推理出现回答混乱问题及解决

5.13 使用单卡Ascend910进行LLaMA2-7B推理,速度缓慢

5.14 MindSpore大模型微调时报溢出及解决

5.15 MindSpore大模型在线推理速度慢及解决方案

5.16 Llama推理报参数校验错误TypeError: The input value must be int. but got 'NoneType.

5.17 模型启动时报Malloc device memory failed

5.18 MindSpore模型报错Reason: Memory resources are exhausted.

5.19 MindSpore2.2.10 ge图模式报错: Current execute mode is KernelByKernel, the processes must be launched with OpenMPI or …

5.20 baichuan2-13b算子溢出 loss跑飞问题和定位

5.21 MindSpore训练异常中止：Try to send request before Open()、Try to get response before Open()、Response is empty

5.22 Ascend 910训练脚本刚运行就报错：RuntimeError: Initialize GE failed!

5.23 昇腾910上算子溢出问题分析

5.24 使用ops.nonzero算子报错TypeError

5.25 训练过程中评测超时导致训练过程发生中断

5.26 昇腾910上CodeLlama推理报错get fail deviceLogicId[0]

5.27 昇腾910上CodeLlama导出mindir模型报错rankTablePath is invalid

5.28 权重文件被异常修改导致加载权重提示Failed to read the checkpoint file

5.29 Mindformers模型启动时因为host侧OOM导致任务被kill

5.30 昇腾910FlashAttention适配alibi问题

5.31 llama3.1-8b的lora微调，不开启权重转换会导致维度不匹配，开启了之后会报错找不到rank1的ckpt，但是strategy目录里面是全的

5.32 Mindspore+MIndformer训练plog报错halMemAlloc failed，drvRetCode=6

5.33 MindSpore的离线权重转换接口说明及转换过程

5.34 MindSpore的ckpt格式完整权重和分布式权重互转

5.35 MindSpore+MindFormer-r.1.2.0微调qwen1.5 报错

5.36 MIndformer训练plog中算子GatherV2_xxx_high_precision_xx报错

5.37 mindformers进行Lora微调后的权重合并

5.38 pangu-100b 2k集群线性度问题定位

5.39 Mixtral 8*7B 大模型精度问题总结

5.40 大模型内存占用调优

5.41 大模型网络算法参数和输出对比

5.42 使用jit编译加速时编译流程中的常见if控制流问题

5.43 jit编译加速功能开启时，如何避免多次重新编译

5.44 编译时报错ValueError: Please set a unique name for the parameter.

5.45 大模型动态图训练内存优化

5.46 大模型精度收敛分析和调优

5.47 Dump工具应用—算子执行报错输入数据值越界

5.48 Dump工具应用—网络训练溢出

5.49 模型训练长稳性能抖动或劣化问题经验总结

5.50 msprobe工具应用–网络训练溢出

5.51 msprobe精度定位工具常见问题整理

5.52 模型编译的性能优化总结

5.53 大模型动态图训练性能调优指南

5.54 大模型迭代拖尾和其他性能优化

5.55 大模型前反向计算的性能优化

5.56 大模型迭代间隙的性能优化

分类六：模型切分问题案例

6.1 MindSpore权重自动切分后报错ValueError: Failed to read the checkpoint. please check the correct of the file.

6.2 MindSpore2机自动切分模型权重报错NotADirectoryError: ./output/transformed chec kpoint/rank_15 is not a real directory

6.3 Mindspore并行策略下hccl_tools工具使用报错

6.4 MindSpore大模型报错: Inner Error! EZ9999 [InferShape] The k-axis of a(131072) and b(16384) tensors must be the same.

6.5 Transformers报错google.protobuf.message.DecodeError: Wrong wire type in tag.

分类七：其他类型问题案例

7.1 报错日志不完整

7.2 日志显示没有成功加载预训练模型：model built, but weights is unloaded, since the config has no attribute or is None.

7.3 工作目录问题：‘from mindformers import Trainer’报错ModuleNotFoundError:No module named ’ mindformers’

7.4 NFS上生成mindrecord报错Failed to write mindrecord meta files

7.5 盘古-智子38B在昇腾910上，greedy模式下无法固定输出

7.6 昇腾上moxing拷贝超时导致Notify超时

7.7 MTP Ascend910切换不同型号设备报错：KeyError：‘group_list’

这里也贴心的为大家提供了三种离线版本的报错解决地图，欢迎下载哦！

· docx版本下载地址

· pdf版本下载地址

· md版本下载地址

评论区主理人，您准备好了吗？

话题		回复	浏览量	时间点
【MindSpore报错解决地图】常见报错问题和解决方案（持续更新）经验分享 Tech Blogs	3	122	2025 年11 月 17 日
MindSpore2.2.10使用Flash attention特性报错AttributeError: module 'mindspore.nn'has no attribute 'FlashAttention' 安装经验-Installation Experience	0	28	2025 年10 月 2 日
MindSpore2.2.10用Flash attention特性报错AttributeError: module 'mindspore.nn'has no attribute 'FlashAttention' 安装经验-Installation Experience	0	24	2025 年10 月 14 日
MindSpore+MindFormer-r.1.2.0微调qwen1.5 报错分布式并行-Distributed Parallelsim	1	38	2025 年8 月 3 日
MindSpore论坛报错活动第四十期活动公告 Activities	0	39	2026 年1 月 14 日