1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend/GPU/CPU
MindSpore版本: mindspore=2.1.0
执行模式(PyNative/ Graph):Graph
Python版本: Python=3.9.7
操作系统平台: 不限
2 报错信息
2.1 问题描述
mindspore文档:https://mindspore.cn/tutorials/experts/zh-CN/r2.1/parallel/memory_offload.html
参考如上的mindspore文档,使用mindspore2.1环境,cann6.3,训练代码正常。
import mindspore
offload_config = {"offload_param":"cpu","auto_offload":False,"offload_cpu_size":"512GB","offload_disk_size":"1024GB","offload_path":"./offload/","host_mem_block_size":"1GB","enable_aio":True,"enable_pinned_mem":True}
mindspore.set_context(mode=mindspore.GRAPH_MODE,memory_offload="ON",max_device_memory="30GB")
mindspore.set_offload_context(offload_config=offload_config)
在代码中加上异构存储管理就提示RuntimeError: HCCL is initialized in GRAPH_MODE but current execution mode is GRAPH_MODE disable TASK_SINK. (HCCL(华为集群通信库)已经在图模式(GRAPH_MODE)下初始化,但当前的执行模式是禁用了任务下沉(TASK_SINK)的图模式)。
2.2 报错信息
RuntimeError: HCCL is initialized in GRAPH_MODE but current execution mode is GRAPH_MODE disable TASK_SINK.
3 根因分析
当开启异构训练时,启动脚本需要采用mpi方式启动
4 解决方案
可以进行如下的修改
#!/bin/bash
# applicable to Ascend or GPU
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash run.sh BATCH_SIZE MEMORY_OFFLOAD"
echo "For example: bash run.sh 96 ON"
echo "=============================================================================================================="
set -e
EXEC_PATH=$(pwd)
BATCH_SIZE=$1
MEMORY_OFFLOAD=$2
OFFLOAD_PARAM="cpu"
AUTO_OFFLOAD=true
OFFLOAD_CPU_SIZE="512GB"
OFFLOAD_DISK_SIZE="1024GB"
EXEC_PATH=$(pwd)
if [ ! -d "${EXEC_PATH}/cifar-10-binary" ]; then
if [ ! -f "${EXEC_PATH}/cifar-10-binary.tar.gz" ]; then
wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/cifar-10-binary.tar.gz
fi
tar -zxvf cifar-10-binary.tar.gz
fi
export DATA_PATH=${EXEC_PATH}/cifar-10-batches-bin
mpirun -n 8 --output-filename log_output --merge-stderr-to-stdout python train.py \
--batch_size=$BATCH_SIZE --memory_offload=$MEMORY_OFFLOAD \
--offload_param=$OFFLOAD_PARAM --auto_offload=$AUTO_OFFLOAD \
--offload_cpu_size=$OFFLOAD_CPU_SIZE --offload_disk_size=$OFFLOAD_DISK_SIZE \
--host_mem_block_size="1GB" --enable_pinned_mem=true --enable_aio=true