1. 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=2.2
执行模式(PyNative/ Graph):不限
2. 报错信息
2.1问题描述
模型在Ascend上微调时持续报溢出
dspore/train/model.py:653] In dataset_sink mode (dataset_size % sink_size) should equal to 0, it is suggested to pad/drop data or adjust size
- INFO - { Epoch:[ 1/ 2], step:[ 2/ 3915], loss: 1.653, per_step_time: 64622ms, Ir: 0.0, overflow cond: True, loss_scale: 16384.0
- INFO - 0.1%
- INFO - { Epoch:[ 1/ 2], step:[ 4/ 3915], loss: 1.510, per_step_time: 741ms, Ir: 0.0, overflow cond: True, loss_scale: 4096.0
- INFO - 0.1%
- INFO - { Epoch:[ 1/ 2], step:[ 6/ 3915], loss: 1.606, per_step_time: 753ms, Ir: 0.0, overflow cond: True, loss_scale: 1024.0
- INFO - 0.1%
- INFO - { Epoch:[ 1/ 2], step:[ 8/ 3915], loss: 1.309, per_step_time: 741ms, Ir: 0.0, overflow cond: True, loss_scale: 256.0
- INFO - 0.1%
- INFO - { Epoch:[ 1/ 2], step:[10/ 3915], loss: 1.233, per_step_time: 750ms, Ir: 0.0, overflow cond: True, loss_scale: 64.0
- INFO - 0.2%
- INFO - { Epoch:[ 1/ 2], step:[12/ 3915], loss: 2.089, per_step_time: 739ms, Ir: 0.0, overflow cond: True, loss_scale: 16.0
- INFO - 0.2%
- INFO - { Epoch:[ 1/ 2], step:[14/ 3915], loss: 1.446, per_step_time: 743ms, Ir: 0.0, overflow cond: True, loss_scale: 4.0
- INFO - 0.2%
- INFO - { Epoch:[ 1/ 2], step:[16/ 3915], loss: 1.639, per_step_time: 741ms, Ir: 0.0, overflow cond: True, loss_scale: 1.0
- INFO - 0.2%
3. 解决方案
#设置是否校验中间过程溢出
export MS_ASCEND_CHECK_OVERFLOW_MODE=INFNAN_MODE
在遇到持续溢出问题时可以尝试
INFNAN模式会忽略过程溢出,只要结果是非溢出的 就会继续训练
饱和模式是有中间过程溢出,就会上报溢出,算法就不更新loss。