昇腾910上CodeLlama推理报错get fail deviceLogicId[0]

1. 系统环境

硬件环境(Ascend/GPU/CPU): Ascend910
MindSpore版本: mindspore=2.2.10
执行模式(PyNative/ Graph):不限
Python版本: Python=3.8.15
操作系统平台: linux

2. 问题描述

报错信息如下

[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:04.666.308 [op_base.cc:56] [2082087] [HcclGetDeviceld] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:04.938.806 [op_base.cc:56] [2082083] [HcclGetDeviceld] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.015.145 [op_base.cc:56] [2082079] [HcclGetDeviceld] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.054.705 [op_base.cc:56] [2082088] [HcclGetDeviceId] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.098.737 [op_base.cc:56] [2082082] [HcclGetDeviceId] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.151.236 [op_base.cc:56] [2082084] [HcclGetDeviceId] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.307.610 [op_base.cc:56] [2082080] [HcclGetDeviceld] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.505.706 [op_base.cc:56] [2082086] [HcclGetDeviceId] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.554.217 [op_base.cc:56] [2082077] [HcclGetDeviceId] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.615.027 [op_base.cc:56] [2082076] [HcclGetDeviceId] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.859.348 [op_base.cc:56] [2082081] [HcclGetDeviceId] get fail deviceLogicId[0]  
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:06.917.314 [op_base.cc:56] [2082078] [HcclGetDeviceId] get fail deviceLogicId[0]

3. 根因分析

HCCL初始化接口未调用。

4. 解决方案

检查推理配置脚本,增加rank_table_file

[ascend_context]  
#plugin_custom_ops=KVcache  
# enable_custom_op=All  
provider=ge  
rank_table_file=/data/pangshiguan/mindformer-hk/hccl_8p_01234567_7.213.219.27.json