重磅干货!MindSpore大模型报错解决地图来啦(持续更新中)

MindSpore Transformers(https://www.mindspore.cn/mindformers/docs/zh-CN/master/index.html)套件目标是构建一个大模型训练、微调、评估、推理、部署的全流程开发套件,是基于MindSpore内置的并行技术和组件化设计。本帖根据使用MindSpore大模型过程中积累的经验,总结了包含报错信息和解决方案的常见案例(当前仅涵盖迁移过程中的问题,持续更新中),可根据报错信息找到对应的解决方案。
分类一:环境问题案例

1.1 MindSpore2.2.10使用Flash attention特性报错AttributeError: module ‘mindspore.nn’has no attribute’FlashAttention’

1.2 MindSpore保存模型提示:need to checkwhether you is batch size and so on in the ‘net’ and ‘parameter dict’ are same.

1.3 MindSpore开启profiler功能报错IndexError:list index out of range

1.4 MindSpore多机运行Profiler报错ValueError: not enough values to unpack (expected 4, got 0)

1.5 MindSpore在yaml文件的callbacks中配置SummaryMonitor后,开启summary功能失效

1.6 MindSpore模型权重功能无法保存更新后的权重

1.7 MindSpore开启MS_DISABLE_REF_MODE导致报错The device address type is wrong:type name in address:CPU,type name in context:Ascend

1.8 MindSpore和tbe版本不匹配问题及解决

1.9 Ascend上构建MindSpore报has no member named ‘update output desc dpse’ ;did you mean ‘update_output_desc_dq’?

1.10 MindSpore开启profile,使用并行策略报错ValueError: When dıstrıbuted loads are slıced we1ghts, sınk mode must be set True.

1.11 模型启动时报Malloc device memory failed

1.12 MindSpore盘古模型报错Failed to allocate memory.Possible Cause: Available memory is insufficient.

1.13 Ascend910环境分离部署时请求超时

分类二:配置问题案例

2.1 MindSpore报错ImportError cannot import name "build dataset loader’ from ‘mindformers.dataset. dataloader’

2.2 MindSpore大模型报错xxx is not a supported default model or a valid path to checkpoint.

2.3 llama2模型转换报错ImportError: cannot import name ‘swap_cache’ from ‘mindspore._c_expression’

2.4 MindSpore报错BrokenPipeError: [Errno 32] Broken pipe

2.5 昇腾910上CodeLlama报错Session options is not equal in diff config infos when models’ weights are shared, last session options

2.6 集成通信库初始化init()报错要求使用mpirun启动多卡训练

2.7 INFNAN模式溢出问题

分类三:数据处理案例

3.1 Mindrecoder 格式转换报错ValueError: For ‘Mul’. x.shape and y.shape need to broadcast.

3.2 数据处理成Mindrecord数据时出现ImportError: cannot import name ‘tik’ from ‘te’

3.3 MindSpore数据并行报错Call GE RunGraphWithStreamAsync Failed,EL0004: Failed to allocate memory.

3.4 MindSpore训练报错TypeError: Invalid Kernel Build Info! Kernel type: AICPU_KERNEL, node: Default/Concat-op1

3.5 MindSpore模型正向报错Sync stream failed:Ascend_0,plog显示Gather算子越界

3.6 model.train报错Exception in training: The input value must be int and must > 0, but got ‘0’ with type ‘int’.

3.7 数据集异常导致编译(model.build)或者训练(model.train)卡住

3.8 baichaun2-13b 在Ascend910上持续溢出

3.9 MTP数据集分布式读写锁死,Failed to execute the sql [SELECT NAME from SHARD NAME;] while verifying meta file, database is locked]

3.10 MTP任务卡死,平台报错信息[‘ROOT_CLUSTER’] job failed.

3.11 大模型获取性能数据方法

3.12 模型解析性能数据

3.13 数据集处理结果精细对比

3.14 通过优化数据来加速训练速度

分类四:并行问题案例

4.1 MindSpore分布式模型并行报错:operator Mul init failed或者CheckStrategy failed.

4.2 MindSpore跑模型并行报错ValueError: array split does not result in an equal division

4.3 MindSpore训练大模型报错:BrokenPipeError: [Errno 32] Broken pipe, EOFError

4.4 MindSpore大模型并行需要在对应的yaml里面做哪些配置

4.5 模型并行显示内存溢出

4.6 MindSpore模型Pipeline并行发现有些卡的log中loss为0

4.7 MindSpore8卡报Socket times out问题

4.8 MindSpore模型Pipeline并行训练报错RuntimeError: Stage 0 should has at least 1 parameter. but got none.

4.9 并行策略为8:1:1时报错RuntimeError: May you need to check if the batch size etc. in your ‘net’ and ‘parameter dict’ are same.

4.10 模型并行策略为 1:1:8 时报错RuntimeError: Stage num is 8 is not equal to stage used: 5

4.11 单机4卡分布式推理失报错RuntimeError: Ascend kernel runtime initialization failed. The details refer to ‘Ascend Error Message’.

4.12 MindSpore跑分布式报错TypeError: The parameters number of the function is 636, but the number of provided arguments is 635.

4.13 MindSpore分布式8节点报错Call GE RunGraphWithStreamAsync Failed, ret is: 4294967295

4.14 MindFormers进行单机八卡调用时报错No parameter is entered. Notice that the program will run on default 8 cards.

4.15 MindSpore分布式并行报错The strategy is XXX, shape XXX cannot be divisible by strategy value XXX

4.16 MTP使用多进程生成mindrecord,报错RuntimeError: Unexpected error. [Internal ERROR] Failed to write mindrecord meta files.

4.17 流水线并行报错Reshape op can’t be a border.

4.18 增加数据并行数之后模型占用显存增加

4.19 MindSpore分布式ckpt权重A转换为其他策略的分布式权重B

4.20 流水线并行没开Cell共享导致编译时间很长

4.21 多机训练报错:import torch_npu._C ImportError: libascend_hal.so: cannot open shared object file: No such file or directory

4.22 docker执行报错:RuntimeError: Maybe you are trying to call ‘mindspore.communication.init()’ without using ‘mpirun’

分类五:训练推理问题案例

5.1 MindSpore报错:TypeError: Multiply values for specific argument: query_embeds

5.2 Tokenizer指向报错TypeError GPT2Tokenizer: init () missing 2 required positional arguments: ‘vocab_file’ and "merges_file

5.3 Tokenizer文件缺失报错TypeError:_init() missing 2 required positional arguments: ‘vocab_file’ and ‘merge_file’

5.4 模型推理报错RuntimeError A model class needs to define a `prepare inputs fordgeneration` method in order to use .generate()`

5.5 MindSpore模型推理报错:memory isn’t enough and alloc failed, kernel name: kernel_graph_@ HostDSActor, alloc size: 8192B

5.6 MindSpore模型推理报错TypeError: cell_reuse() takes 0 positional arguments but 1 was given

5.7 运行wizardcoder迁移代码报错broken pipe

5.8 MindSpore Lite模型加载报错RuntimeError: build from file failed! Error is Common error code.

5.9 Mindspore在自回归推理时的精度对齐设置

5.10 MindSpore大模型打开pp并行或者梯度累积之后loss不溢出也不收敛

5.11 Ascend上用ADGEN数据集评估时报错not support in PyNative RunOp!

5.12 qwen1.5_1.8B推理出现回答混乱问题及解决

5.13 使用单卡Ascend910进行LLaMA2-7B推理,速度缓慢

5.14 MindSpore大模型微调时报溢出及解决

5.15 MindSpore大模型在线推理速度慢及解决方案

5.16 Llama推理报参数校验错误TypeError: The input value must be int. but got 'NoneType.

5.17 模型启动时报Malloc device memory failed

5.18 MindSpore模型报错Reason: Memory resources are exhausted.

5.19 MindSpore2.2.10 ge图模式报错: Current execute mode is KernelByKernel, the processes must be launched with OpenMPI or …

5.20 baichuan2-13b算子溢出 loss跑飞问题和定位

5.21 MindSpore训练异常中止:Try to send request before Open()、Try to get response before Open()、Response is empty

5.22 Ascend 910训练脚本刚运行就报错:RuntimeError: Initialize GE failed!

5.23 昇腾910上算子溢出问题分析

5.24 使用ops.nonzero算子报错TypeError

5.25 训练过程中评测超时导致训练过程发生中断

5.26 昇腾910上CodeLlama推理报错get fail deviceLogicId[0]

5.27 昇腾910上CodeLlama导出mindir模型报错rankTablePath is invalid

5.28 权重文件被异常修改导致加载权重提示Failed to read the checkpoint file

5.29 Mindformers模型启动时因为host侧OOM导致任务被kill

5.30 昇腾910FlashAttention适配alibi问题

5.31 llama3.1-8b的lora微调,不开启权重转换会导致维度不匹配,开启了之后会报错找不到rank1的ckpt,但是strategy目录里面是全的

5.32 Mindspore+MIndformer训练plog报错halMemAlloc failed,drvRetCode=6

5.33 MindSpore的离线权重转换接口说明及转换过程

5.34 MindSpore的ckpt格式完整权重和分布式权重互转

5.35 MindSpore+MindFormer-r.1.2.0微调qwen1.5 报错

5.36 MIndformer训练plog中算子GatherV2_xxx_high_precision_xx报错

5.37 mindformers进行Lora微调后的权重合并

5.38 pangu-100b 2k集群线性度问题定位

5.39 Mixtral 8*7B 大模型精度问题总结

5.40 大模型内存占用调优

5.41 大模型网络算法参数和输出对比

5.42 使用jit编译加速时编译流程中的常见if控制流问题

5.43 jit编译加速功能开启时,如何避免多次重新编译

5.44 编译时报错ValueError: Please set a unique name for the parameter.

5.45 大模型动态图训练内存优化

5.46 大模型精度收敛分析和调优

5.47 Dump工具应用—算子执行报错输入数据值越界

5.48 Dump工具应用—网络训练溢出

5.49 模型训练长稳性能抖动或劣化问题经验总结

5.50 msprobe工具应用–网络训练溢出

5.51 msprobe精度定位工具常见问题整理

5.52 模型编译的性能优化总结

5.53 大模型动态图训练性能调优指南

5.54 大模型迭代拖尾和其他性能优化

5.55 大模型前反向计算的性能优化

5.56 大模型迭代间隙的性能优化

分类六:模型切分问题案例

6.1 MindSpore权重自动切分后报错ValueError: Failed to read the checkpoint. please check the correct of the file.

6.2 MindSpore2机自动切分模型权重报错NotADirectoryError: ./output/transformed chec kpoint/rank_15 is not a real directory

6.3 Mindspore并行策略下hccl_tools工具使用报错

6.4 MindSpore大模型报错: Inner Error! EZ9999 [InferShape] The k-axis of a(131072) and b(16384) tensors must be the same.

6.5 Transformers报错google.protobuf.message.DecodeError: Wrong wire type in tag.

分类七:其他类型问题案例

7.1 报错日志不完整

7.2 日志显示没有成功加载预训练模型:model built, but weights is unloaded, since the config has no attribute or is None.

7.3 工作目录问题:‘from mindformers import Trainer’报错ModuleNotFoundError:No module named ’ mindformers’

7.4 NFS上生成mindrecord报错Failed to write mindrecord meta files

7.5 盘古-智子38B在昇腾910上,greedy模式下无法固定输出

7.6 昇腾上moxing拷贝超时导致Notify超时

7.7 MTP Ascend910切换不同型号设备报错:KeyError:‘group_list’

这里也贴心的为大家提供了三种离线版本的报错解决地图,欢迎下载哦!

· docx版本下载地址

· pdf版本下载地址

· md版本下载地址