集成通信库初始化init()报错要求使用mpirun启动多卡训练

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: mindspore=2.2.10
执行模式(PyNative/ Graph):PyNative/ Graph
Python版本: Python=3.8.15
操作系统平台: linux

2 报错信息

Traceback (most recent call last):  
  File "train.py", line 185, in <module>  
    main()  
  File "train.py", line 141, in main  
    set_parameter()  
  File "train.py", line 65, in set_parameter  
    init()  
  File "/home/ma-user/anaconda3/envs/MindSpore/1ib/python3.7/site-packages/mindspore/communication/management.py", line 172. in init  
    init_cluster()  
RuntimeError: Maybe you are trying to call 'mindspore.communication.init()' without using 'mpirun', which will make MindSpore load several environment variables and check their validation. Please use 'mpirun' to launch this process to fix this issue, or refer to this link if you want to run distributed training without using 'mpirun': https://www.mindspore.cn/tutorials/experts/zh-CN/master/paralle1/train_gpu.html

相关源码

if backend_name == "hccl":  
        if _is_ps_mode():  
            # Use MindSpore cluster to build network for Parameter Server traning.  
            init_cluster()  
            if _is_role_sched() or _is_role_pserver():  
                raise RuntimeError("Parameter server and scheduler should use 'CPU' as backend instead of 'Ascend'")  
            if _get_ps_context("worker_num") == 1:  
                GlobalComm.INITED = True  
                _set_elegant_exit_handle()  
                return  
        if device_target != "Ascend":  
            raise RuntimeError("For 'init', the argument  'backend_name' should be 'Ascend' to init hccl, "  
                                "but got {}".format(device_target))  
        if not host_init:  
            _check_parallel_envs()  
        GlobalComm.BACKEND = Backend("hccl")  
        init_hccl()  
        GlobalComm.WORLD_COMM_GROUP = HCCL_WORLD_COMM_GROUP  
    elif backend_name == "nccl":  
        init_cluster()  
        GlobalComm.WORLD_COMM_GROUP = NCCL_WORLD_COMM_GROUP  
    elif backend_name == "mccl":  
        init_cluster()  
        GlobalComm.WORLD_COMM_GROUP = MCCL_WORLD_COMM_GROUP  
    else:  
        raise RuntimeError("For 'init', the argument 'backend_name' must be nccl while 'device_target' is GPU, "  
                            "but got the 'backend_name' : hccl.")

3 根因分析

这是因为昇腾机器,调用init()来初始化集成通信库,没有显式指定初始化的集成通信库类型。报错日志的调用栈中走到了init_cluster(),但是阅读源码发现,初始化昇腾的集成通信库hccl(非ps模式)的时候并不会调用init_cluster(),代码走到了gpu的集成通信库nccl的初始化分支。

4 解决方案

调用init初始化集成通信库时显式指定后端,如在昇腾机器上调用init("hccl")