MindSpore大模型微调时报溢出及解决

1. 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=2.2
执行模式(PyNative/ Graph):不限

2. 报错信息

2.1问题描述

模型在Ascend上微调时持续报溢出

dspore/train/model.py:653] In dataset_sink mode (dataset_size % sink_size) should equal to 0, it is suggested to pad/drop data or adjust size  
 - INFO - { Epoch:[ 1/ 2], step:[ 2/ 3915], loss: 1.653, per_step_time: 64622ms, Ir: 0.0, overflow cond: True, loss_scale: 16384.0  
 - INFO -     0.1%  
 - INFO - { Epoch:[ 1/ 2], step:[ 4/ 3915], loss: 1.510, per_step_time: 741ms, Ir: 0.0, overflow cond: True, loss_scale: 4096.0  
 - INFO -     0.1%  
 - INFO - { Epoch:[ 1/ 2], step:[ 6/ 3915], loss: 1.606, per_step_time: 753ms, Ir: 0.0, overflow cond: True, loss_scale: 1024.0  
 - INFO -     0.1%  
 - INFO - { Epoch:[ 1/ 2], step:[ 8/ 3915], loss: 1.309, per_step_time: 741ms, Ir: 0.0, overflow cond: True, loss_scale: 256.0  
 - INFO -     0.1%  
 - INFO - { Epoch:[ 1/ 2], step:[10/ 3915], loss: 1.233, per_step_time: 750ms, Ir: 0.0, overflow cond: True, loss_scale: 64.0  
 - INFO -     0.2%  
 - INFO - { Epoch:[ 1/ 2], step:[12/ 3915], loss: 2.089, per_step_time: 739ms, Ir: 0.0, overflow cond: True, loss_scale: 16.0  
 - INFO -     0.2%  
 - INFO - { Epoch:[ 1/ 2], step:[14/ 3915], loss: 1.446, per_step_time: 743ms, Ir: 0.0, overflow cond: True, loss_scale: 4.0  
 - INFO -     0.2%  
 - INFO - { Epoch:[ 1/ 2], step:[16/ 3915], loss: 1.639, per_step_time: 741ms, Ir: 0.0, overflow cond: True, loss_scale: 1.0  
 - INFO -     0.2%

3. 解决方案

#设置是否校验中间过程溢出  
export MS_ASCEND_CHECK_OVERFLOW_MODE=INFNAN_MODE

在遇到持续溢出问题时可以尝试
INFNAN模式会忽略过程溢出,只要结果是非溢出的 就会继续训练
饱和模式是有中间过程溢出,就会上报溢出,算法就不更新loss。