Ascend 910训练脚本刚运行就报错:RuntimeError: Initialize GE failed!

1 系统环境

硬件环境(Ascend/GPU/CPU): Ascend 910
MindSpore版本: mindspore=2.2.0
执行模式(PyNative/ Graph):PyNative/ Graph
Python版本: Python=3.8.15
操作系统平台: linux

2 报错信息

训练脚本刚运行就报错:RuntimeError: Initialize GE failed!

File “./train_test.py”, line 136, in run  
    mtl_fwk « MultiTaskFramework(config-config)  
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/framework.py”, line 122, in init  
    self._build customized(self.cfg.task)  
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/framework.py", line 352, in build customized  
    train_data_utils = build_dataloader(task_cfg.dataset.train.nme, task_cfg.dataset.train, Inue) \  
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/subnet_context.py”, Line 423, in build dataloader  
    build_loader_sampler(dataset-builders.dataset.get_builder(name, dataset_fields)(cfg), is_training-training))  
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/subnet_context.py”, line 221, in build loader sampler  
    bird_view_feature_ batch « BirdViewFeatureBatchap(dataset .config)  
File “/train-worker1 -log/bev/uvpsandbox-api/uvpsandbox/uvpsandbox/subnet_context.py”, line 137, in init _  
    self.lidar_woxel_layer = BindViefeatureBatch(cfg)  
File “/train-worker1 -log/bev/tasks/bev_task/uvp_module/models/lidar_backbone/bird_view_feature_batch.py”, line 29, in __init_  
    self.op_init = BirdViewInit(cfg)  
File “/train-worker! -log/bev/tasks/bev_task/uvp_module/models/lidar_backbone/lidar_det_ext.py”, line 10, in __init_  
    super(BirdViewInit, self).__init__()  
File “/home/ma-user/miniconda3/envs/ms: .@MixPre/lib/python3.8/site-packages/mindspore/nn/cell.py”, line 135, in init _  
    init_pipelline()  
Runtimetrror: Initialize GE failed!

3 解决方案

是典型的GE初始化失败,怀疑磁盘空间不足,查看df -h, 发现overlay的空间占用100%,因为之前跑8卡发生core dump,其dump文件保存在 /tmp/core-file路径,导致某空间被占满,初始化GE报错。删除core文件后解决。