模型启动时报Malloc device memory failed

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 不限
执行模式(PyNative/ Graph):不限

2 问题描述

程序刚启动报Malloc device memory failed。

2023-08-24 09:03:49,594 - mindformers - INFO - Network Parameters: 13264 M.
Traceback (most recent call last):
    File "/home/ma-user/work/mindformers/research/baichuan/run_baichuan_13b.py", line 115, in <module>
        main(task-args. task,
    File "/home/ma-user/work/mindformers/research/baichuan/run_baichuan_13b.py", line 82, in main
        result - trainer.predict(input_data-predict_data,
    File "/home/ma-user/anaconda:s/envs/MindSpore/1ib/python3.9/site-packages/mindformers/trainer/trainer.py", line 678, in predict
        output result = self.trainer•predict(
    File "/home/ma-user/anaconda:3/envs/MindSpore/1ib/python3.9/site-packages/mindformers/trainer/causal_language_modeling/causal-language_modeli	ng-py", line 309, in predict
        return self.predict_process(config-config,
    File "/home/ma-user/anaconda3/envs/MindSpore/1ib/python3.9/site-packages/mindformers/trainer/base_trainer.py", line 744,. in predict_process
        transform_and_load_checkpoint(config, model, network, None, do predict True)
    File "/home/ma-user/anaconda:8/envs/MindSpore/1ib/python3.9/site-packages/mindformers/trainer/utils.py", 1ine 294, in transform_and_load_checkpoint
        raise FileNotFoundError(f"The load checkpoint must be correct,
FileNotFoundError: The load_checkpoint must be correct, but get /home/ma-user/work/mindformers/research/ckpt_ dir/tıansform_ 13b
[ERROR] PIPELINE(38220, ffff9632db30,python):2023-08-24-09:03:49.648.021 [mindspore/ccsrc/pipeline/jit/pipeline.cc:2073] ClearResAtexit] Check exception before process exit: Ascend kernel runtime initialization failed. The details refer to "Ascend Error Message..
Framework Error Message:
Malloc device memory failed, free memory size is less than half of total memory size.Device 0 Device HBM total size:34359738368 Device HBM free size:2094153728 maybe other processes occupying this card, check as: ps -eflgrep python
C++ Call Stack: (For framework developers)
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime .cc:413 Init
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_adapter.cc:63 Initialize

3 解决方案

这种情况是因为本次运行的卡显存已被占用超过一半,可以通过npu-smi info查看占用卡的进程号,再通过ps -aux | grep <pid_num> 查看该进程具体信息,如果要终止进程可以通过kill -9 <pid_num>, 如果不能终止进程,可以通过指定其他空闲device_id重新运行程序。