1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend 910
MindSpore版本: mindspore=2.2.0
执行模式(PyNative/ Graph):PyNative/ Graph
Python版本: Python=3.8.15
操作系统平台: linux
2 报错信息
训练脚本刚运行就报错:RuntimeError: Initialize GE failed!
File “./train_test.py”, line 136, in run
mtl_fwk « MultiTaskFramework(config-config)
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/framework.py”, line 122, in init
self._build customized(self.cfg.task)
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/framework.py", line 352, in build customized
train_data_utils = build_dataloader(task_cfg.dataset.train.nme, task_cfg.dataset.train, Inue) \
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/subnet_context.py”, Line 423, in build dataloader
build_loader_sampler(dataset-builders.dataset.get_builder(name, dataset_fields)(cfg), is_training-training))
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/subnet_context.py”, line 221, in build loader sampler
bird_view_feature_ batch « BirdViewFeatureBatchap(dataset .config)
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/subnet_context.py”, line 137, in init _
self.lidar_woxel_layer = BindViefeatureBatch(cfg)
File “/train-worker1 -log/bev/tasks/bev_task/uvp_module/models/lidar_backbone/bird_view_feature_batch.py”, line 29, in __init_
self.op_init = BirdViewInit(cfg)
File “/train-worker! -log/bev/tasks/bev_task/uvp_module/models/lidar_backbone/lidar_det_ext.py”, line 10, in __init_
super(BirdViewInit, self).__init__()
File “/home/ma-user/miniconda3/envs/ms: .@MixPre/lib/python3.8/site-packages/mindspore/nn/cell.py”, line 135, in init _
init_pipelline()
Runtimetrror: Initialize GE failed!
3 解决方案
是典型的GE初始化失败,怀疑磁盘空间不足,查看df -h, 发现overlay的空间占用100%,因为之前跑8卡发生core dump,其dump文件保存在 /tmp/core-file路径,导致某空间被占满,初始化GE报错。删除core文件后解决。