【案例】【Mindspore】【离线权重转换系列三】MindSpore的ckpt格式完整权重和分布式权重互转

场景四:完整ckpt 权重转换为分布式权重

建议还是用safetensors格式转换

方法一:通过ckpt_to_safetensors + transform_checkpoint


建议使用safetensors进行权重转换

import mindspore as ms  

ms.transform_checkpoints (src_checkpoints_dir, dst_checkpoints_dir, ckpt_prefix, src_strategy_file=None, dst_strategy_file=None, process_num=1, output_format='ckpt')  
print("Transform ckpt DONE", flush=True)  
  • src_checkpoints_dir:权重A的路径,分布式权重指定到rank的上一层路径,完整权重指定到文件名
  • dst_checkpoints_dir:生成权重B的路径
  • ckpt_prefix:生成权重B文件名的前缀
  • src_strategy_file:权重A合并后的策略文件,对于完整权重指定None
  • dst_strategy_file:权重B合并后的策略文件,对于完整权重指定None
  • process_num:线程格数,根据自己机器cpu、内存使用情况绝对
  • output_format:“ckpt” 或 “safetensors”

样例:

  1. 合并目标策略文件
  2. ckpt转safetensors
import mindspore as ms  

ms.ckpt_to_safetensors(  
file_path="/xxx/ckpt_convert/weight_o/net.ckpt",  
save_path="/xxx/ckpt_convert/weight_o/ckpt_to_sf_all/",  
processes_num=8)  

输入:


输出:

3. 调用transform_checkpoints进行safetensors转分布式ckpt

import mindspore as ms  
ms.transform_checkpoints ("/xxx/ckpt_convert/weight_o/ckpt_to_sf_all/rank_0/net.safetensors", "/xxx/ckpt_convert/weight_o/distributed_ckpt", "dst_ckpt_prefix", src_strategy_file=None, dst_strategy_file="/xxx/ckpt_convert/142merge.ckpt", process_num=4, output_format='ckpt')  
print("Transform ckpt DONE", flush=True)  


方法二:通过 ckpt_to_safetensors + load_distributed_checkpoint

import mindspore as ms  
ms.load_distributed_checkpoint(  
network=None, #可以不是None,直接传入Net  
predict_strategy="权重B的策略文件(合并后)", #完整权重指定None  
format='safetensors', #默认是ckpt,取值ckpt或safetensors  
output_format='ckpt', #默认是'safetensors',取值ckpt或safetensors,2.4.0之后可以使用  
unified_safetensors_dir="离线合并后的目录(上一步产生的)",  
dst_safetensors_dir="生成的权重B目录" )  

样例:

  1. 合并目标策略文件
  2. ckpt转safetensors
import mindspore as ms  

ms.ckpt_to_safetensors(  
file_path="/xxx/ckpt_convert/weight_o/net.ckpt",  
save_path="/xxx/ckpt_convert/weight_o/ckpt_to_sf_all/",  
processes_num=8)  


3. 调用load_distributed_checkpoint进行safetensors转分布式ckpt

import mindspore as ms  

ms.load_distributed_checkpoint(  
network=None,  
checkpoint_filenames=None,  
predict_strategy="/xxx/ckpt_convert/142merge.ckpt",  
format='safetensors',  
unified_safetensors_dir="/xxx/ckpt_convert/weight_o/ckpt_to_sf_all/rank_0/net.safetensors",  
dst_safetensors_dir="/xxx/ckpt_convert/weight_o/142distributed_ckpt_1" )  

方法三:transform_checkpoint—不推荐


transform_checkpoints会对输入格式自动识别

import mindspore as ms  

ms.transform_checkpoints ("/xxx/ckpt_convert/weight_o/ckpt_to_sf_all/rank_0/net.ckpt", "/xxx/ckpt_convert/weight_o/distributed_ckpt", "dst_ckpt_prefix", src_strategy_file=None, dst_strategy_file="/xxx/ckpt_convert/142merge.ckpt", process_num=4, output_format='ckpt')  
print("Transform ckpt DONE", flush=True)  

场景五:分布式权重转换为完整ckpt 权重

建议还是用safetensors进行格式转换

方法一:通过ckpt_to_safetensors +transform_checkpoint

import mindspore as ms  

ms.transform_checkpoints (src_checkpoints_dir, dst_checkpoints_dir, ckpt_prefix, src_strategy_file=None, dst_strategy_file=None, process_num=1, output_format='ckpt')  
print("Transform ckpt DONE", flush=True)  
  • src_checkpoints_dir:权重A的路径,分布式权重指定到rank的上一层路径,完整权重指定到文件名
  • dst_checkpoints_dir:生成权重B的路径
  • ckpt_prefix:生成权重B文件名的前缀
  • src_strategy_file:权重A合并后的策略文件,对于完整权重指定None
  • dst_strategy_file:权重B合并后的策略文件,对于完整权重指定None
  • process_num:线程格数,根据自己机器cpu、内存使用情况决定
  • output_format:“ckpt” 或 “safetensors”

样例:

import mindspore as ms  

ms.transform_checkpoints ("/xxx/ckpt_convert/weight_o/ distributed_ckpt ", "/xxx/ckpt_convert/weight_o/all_ckpt", "dst_ckpt_prefix", src_strategy_file="src_strategy", dst_strategy_file=None, process_num=4, output_format='ckpt')  
print("Transform ckpt DONE", flush=True)  

方法二:通过ckpt_to_safetensors+ unified_safetensors+ load_distributed_checkpoint

cke_204550.png
ckpt_to_safetensors+ unified_safetensors—得到分片的sf,可以转分片的ckpt,然后手动合并