MindSpore在GRAPH_MODE下初始化,报错提示当前的执行模式是禁用了任务下沉(TASK_SINK)

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend/GPU/CPU
MindSpore版本: mindspore=2.1.0
执行模式(PyNative/ Graph):Graph
Python版本: Python=3.9.7
操作系统平台: 不限

2 报错信息

2.1 问题描述

mindspore文档:https://mindspore.cn/tutorials/experts/zh-CN/r2.1/parallel/memory_offload.html
参考如上的mindspore文档,使用mindspore2.1环境,cann6.3,训练代码正常。

import mindspore

offload_config = {"offload_param":"cpu","auto_offload":False,"offload_cpu_size":"512GB","offload_disk_size":"1024GB","offload_path":"./offload/","host_mem_block_size":"1GB","enable_aio":True,"enable_pinned_mem":True}
mindspore.set_context(mode=mindspore.GRAPH_MODE,memory_offload="ON",max_device_memory="30GB")
mindspore.set_offload_context(offload_config=offload_config)

在代码中加上异构存储管理就提示RuntimeError: HCCL is initialized in GRAPH_MODE but current execution mode is GRAPH_MODE disable TASK_SINK. (HCCL(华为集群通信库)已经在图模式(GRAPH_MODE)下初始化,但当前的执行模式是禁用了任务下沉(TASK_SINK)的图模式)。

2.2 报错信息

RuntimeError: HCCL is initialized in GRAPH_MODE but current execution mode is GRAPH_MODE disable TASK_SINK.

3 根因分析

当开启异构训练时,启动脚本需要采用mpi方式启动

4 解决方案

可以进行如下的修改

#!/bin/bash  
# applicable to Ascend or GPU  
  
echo "=============================================================================================================="  
echo "Please run the script as: "  
echo "bash run.sh BATCH_SIZE MEMORY_OFFLOAD"  
echo "For example: bash run.sh 96 ON"  
echo "=============================================================================================================="  
set -e  
EXEC_PATH=$(pwd)  
BATCH_SIZE=$1  
MEMORY_OFFLOAD=$2  
OFFLOAD_PARAM="cpu"  
AUTO_OFFLOAD=true  
OFFLOAD_CPU_SIZE="512GB"  
OFFLOAD_DISK_SIZE="1024GB"  
  
EXEC_PATH=$(pwd)  
  
if [ ! -d "${EXEC_PATH}/cifar-10-binary" ]; then  
    if [ ! -f "${EXEC_PATH}/cifar-10-binary.tar.gz" ]; then  
        wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz  
    fi  
    tar -zxvf cifar-10-binary.tar.gz  
fi  
export DATA_PATH=${EXEC_PATH}/cifar-10-batches-bin  
  
mpirun -n 8 --output-filename log_output --merge-stderr-to-stdout  python train.py \  
  --batch_size=$BATCH_SIZE --memory_offload=$MEMORY_OFFLOAD \  
  --offload_param=$OFFLOAD_PARAM --auto_offload=$AUTO_OFFLOAD \  
  --offload_cpu_size=$OFFLOAD_CPU_SIZE --offload_disk_size=$OFFLOAD_DISK_SIZE \  
  --host_mem_block_size="1GB" --enable_pinned_mem=true --enable_aio=true