对pretrain_qwen3_32b_4k.yaml文件做了如下修改
(py39) [mindspore@cloud-9ba17f41-df67-4c06-8d88-680308453fb4-7649d677b6-9ps9p mindformers]$ diff configs/qwen3/pretrain_qwen3_32b_4k.yaml configs/qwen3/pretrain_qwen3_0_6b_4k.yaml
9,11c9,11
< use_parallel: True
< run_mode: ‘train’
< use_legacy: False
use_parallel: False
run_mode: ‘finetune’
use_legacy: True
20c20
< epochs: 2
epochs: 1
44c44
< - 8000 # Number of samples in the training set
- 400 # Number of samples in the training set
61c61
< - “/path/to/wiki103-megatron_text_document”
- "/home/mindspore/work/demo/megatron_data/wikitext-2-v1-qwen3_text_document_text_document"
64c64
< num_parallel_workers: 8
num_parallel_workers: 1
89,92c89,92
< model_parallel: 4 # Number of model parallel
< pipeline_stage: 4 # Number of pipeline parallel
< micro_batch_num: 4 # Pipeline parallel microbatch size
< vocab_emb_dp: True # Whether to split the vocabulary in the data parallel dimension
model_parallel: 1 # Nuimber of model parallel
pipeline_stage: 1 # Number of pipeline parallel
micro_batch_num: 1 # Pipeline parallel microbatch size
vocab_emb_dp: False # Whether to split the vocabulary in the data parallel dimension
102c102
< full_batch: False # Whether to load the full batch of data in parallel mode
full_batch: True # Whether to load the full batch of data in parallel mode
120c120
< parallel_optimizer_comm_recompute: True
parallel_optimizer_comm_recompute: False
125a126
type: “Qwen3ForCausalLM”
126a128
type: “Qwen3Config”
129,132c131,134
< hidden_size: 5120
< intermediate_size: 25600
< num_hidden_layers: 64
< num_attention_heads: 64
hidden_size: 1024 intermediate_size: 3072 num_hidden_layers: 28 num_attention_heads: 16
158c160
< offset: [-1, -1, 1, 1]
offset: 0
经过msrun --bind_core=True --worker_num=1 --local_worker_num=1 --master_port=7118 --log_dir=output/msrun_log --join=True --cluster_time_out=300 run_mindformer.py --config /home/mindspore/work/demo/mindformers/configs/qwen3/pretrain_qwen3_0_6b_4k.yaml拉起后报错
Traceback (most recent call last):
File “/home/mindspore/miniconda3/envs/py39/bin/msrun”, line 7, in
sys.exit(main())
File “/home/mindspore/miniconda3/envs/py39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py”, line 191, in main
run(args)
File “/home/mindspore/miniconda3/envs/py39/lib/python3.9/site-packages/mindspore/parallel/cluster/run.py”, line 185, in run
process_manager.run()
File “/home/mindspore/miniconda3/envs/py39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py”, line 268, in run
self.join_processes()
File “/home/mindspore/miniconda3/envs/py39/lib/python3.9/site-packages/mindspore/parallel/cluster/process_entity/_api.py”, line 387, in join_processes
raise RuntimeError("Distributed job exited with exception. Please check logs in "
RuntimeError: Distributed job exited with exception. Please check logs in directory: output/msrun_log.
日志内容为:
(py39) [mindspore@cloud-9ba17f41-df67-4c06-8d88-680308453fb4-7649d677b6-9ps9p mindformers]$ tail -f output/msrun_log/worker_0.log
network = build_model(config, default_args=default_args)
File “/home/mindspore/work/demo/mindformers/mindformers/models/build_model.py”, line 63, in build_model
model_config = build_model_config(config.model_config, default_args=default_args)
File “/home/mindspore/work/demo/mindformers/mindformers/models/build_config.py”, line 62, in build_model_config
return MindFormerRegister.get_instance_from_cfg(
File “/home/mindspore/work/demo/mindformers/mindformers/tools/register/register.py”, line 389, in get_instance_from_cfg
obj_cls = cls.get_cls(module_type, obj_type)
File “/home/mindspore/work/demo/mindformers/mindformers/tools/register/register.py”, line 287, in get_cls
raise ValueError(f"Can’t find class type {module_type} class name {class_name} in class registry "
ValueError: Can’t find class type config class name Qwen3Config in class registry when use_legacy=True