1. 系统环境
硬件环境(Ascend/GPU/CPU): Ascend910
MindSpore版本: mindspore=2.2.10
执行模式(PyNative/ Graph):不限
Python版本: Python=3.8.15
操作系统平台: linux
2. 问题描述
报错信息如下
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:04.666.308 [op_base.cc:56] [2082087] [HcclGetDeviceld] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:04.938.806 [op_base.cc:56] [2082083] [HcclGetDeviceld] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.015.145 [op_base.cc:56] [2082079] [HcclGetDeviceld] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.054.705 [op_base.cc:56] [2082088] [HcclGetDeviceId] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.098.737 [op_base.cc:56] [2082082] [HcclGetDeviceId] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.151.236 [op_base.cc:56] [2082084] [HcclGetDeviceId] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.307.610 [op_base.cc:56] [2082080] [HcclGetDeviceld] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.505.706 [op_base.cc:56] [2082086] [HcclGetDeviceId] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.554.217 [op_base.cc:56] [2082077] [HcclGetDeviceId] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.615.027 [op_base.cc:56] [2082076] [HcclGetDeviceId] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:05.859.348 [op_base.cc:56] [2082081] [HcclGetDeviceId] get fail deviceLogicId[0]
[ERROR] HCCL(2073294, python3) :2023-12-22-10:36:06.917.314 [op_base.cc:56] [2082078] [HcclGetDeviceId] get fail deviceLogicId[0]
3. 根因分析
HCCL初始化接口未调用。
4. 解决方案
检查推理配置脚本,增加rank_table_file
[ascend_context]
#plugin_custom_ops=KVcache
# enable_custom_op=All
provider=ge
rank_table_file=/data/pangshiguan/mindformer-hk/hccl_8p_01234567_7.213.219.27.json