1. 问题现象
docker内执行脚本报错:
RuntimeError: Maybe you are trying to call 'mindspore.communication.init()' without using 'mpirun', which will make MindSpore load several environment variables and check their validation. Please use 'mpirun' to launch this process to fix this issue, or refer to this link if you want to run distributed training without using 'mpirun': https://www.mindspore.cn/tutorials/experts/zh-CN/master/parallel/train_gpu.html
· 容器外执行npu-smi info可以看到卡未被占用。
· 容器内执行npu-smi info 报错。
DrvMngGetConsoleLogLevel failed. (g_conLogLevel=3)
dcmi model initialized failed, because the device is used. ret is -8020
2. 问题根因
卡已经被其他docker占用,导致当前docker加载不上卡,导致报错,采用如下方式启动的docker是独占卡的,单张卡只能被单个docker加载,导致其他docker内看不到卡。
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci manager \
--device=/dev/devmm svm \
--device=/dev/hisi hdc \
3. 解决方案
使用这种方式加载的卡可以支持多docker共用,docker内都可以看到卡的状态。
docker run -itd --net=host --cap-add=SYS_PTRACE -e ASCEND_VISIBLE_DEVICES=0-7