基于MindSpore Layout推导各卡上的Tensor分片（图解法）

SuperWZB · 2025 年6 月 19 日 12:30

前情提要：

背景：此前编写的一篇Wiki介绍了如何通过3个步骤直接推导出每张卡上的Tensor分片号：基于MindSpore Layout推导各卡上的Tensor分片（列表法） - 技术干货 Technology / 分布式并行-Distributed Parallelsim - 昇思MindSpore论坛
问题：该方法无法直观的显示出一个完整的Tensor是如何逐步切分到个各张卡上的。
方案：因此本文将提出一种可视化推导Tensor分片的方法，降低用户使用Layout配置切分策略的难度，提高框架的易用性。

1. Layout简介

关于Layout的基础知识可以参照以下资料：

mindspore.Layout官方文档

1.1 Layout

Layout是mindspore中用于统一描述Tensor排布的类。

其主要有3个关键成员变量：

①device_matrix：设备矩阵，用于描述集群中的卡如何排布。
②alias_name：别名，用于指定device_matrix中各个设备维度的名字。
③tensor_map：张量映射，用于描述Tensor的各个维度如何切分。

如下代码样例是为一个Tensor配置切分策略的通用模板：

# 设备矩阵，假设一共有8张卡
device_matrix = (2, 4)
# 别名，为了方便起见直接按从左到右的维度顺序命名
alias_name = ("axis_0", "axis_1")
# 张量映射，假设Tensor.shape=(2, 4)
tensor_map = ("axis_0", "axis_1")

layout = Layout(device_matrix, alias_name)
shard_strategy = layout(tensor_map)

下面我们依据这个例子逐一介绍这3个变量的含义.

1.2 device_matrix

设备矩阵，用于描述集群中的卡如何排布，为了便于理解此处以2维设备矩阵为例。

# 设备矩阵
device_matrix = (2, 4)

假设一共有8张卡，根据device_matrix = (2, 4)，这8张卡会排布成2*4的矩阵。

1.3 alias_name

别名，用于指定device_matrix中各个设备维度的名字。

# 别名，为了方便起见直接按从左到右的维度顺序命名
alias_name = ("axis_0", "axis_1")

根据alias_name = ("axis_0", "axis_1")，我们可以为设备矩阵中的各个维度标明别名。

1.4 tensor_map

张量映射，用于描述Tensor的各个维度如何切分。

# 张量映射，假设Tensor.shape=(2, 4)
tensor_map = ("axis_0", "axis_1")

其中len(tensor_map) == len(Tensor.shape)，对于一个tensor_map，假设tensor_map[i] = j：

如果j = "None"，表示张量的第i维度不切分。
如果j = "axis_?"，表示沿着设备矩阵的"axis_?"维度切分Tensor的第i维度。

对于样例中的tensor_map = ("axis_0", "axis_1")，构造一个shape = (2, 4)的Tensor，其依据device_matrix和tensor_map的分片过程如下：

首先假设将Tensor放在rank0上

7a8ec75a-2b62-4d9e-a99c-a5c97c4052871160×351 25.1 KB
tensor_map[0] = "axis_0"：沿着device_matrix的"axis_0"维度切分Tensor的第0维。

f2d117f4-153d-44f3-9364-b77b2ded4b321160×355 26.9 KB
tensor_map[1] = "axis_1"：沿着device_matrix的"axis_1"维度切分Tensor的第1维。

bd953518-affe-4251-81d5-6329924574281160×353 29.1 KB

1.5 Tensor分片结果

最终基于如下Layout配置，得到的分片结果如图所示：

# 设备矩阵，假设一共有8张卡
device_matrix = (2, 4)
# 别名，为了方便起见直接按从左到右的维度顺序命名
alias_name = ("axis_0", "axis_1")
# 张量映射，假设Tensor.shape=(2, 4)
tensor_map = ("axis_0", "axis_1")

至此，Layout中的3个关键成员变量介绍完毕，同时也以一个最简单的例子演示了一遍Tensor分片的推导过程，下面开始介绍完整的基于Layout推导Tensor分片的方法。

2. 推导方法

基于Layout推导Tensor分片的方法总结起来一共2步：

①依据tensor_map从左到右切分Tensor：按从左到右的顺序，沿着tensor_map中指定的device_matrix维度，依次切分Tensor的指定维度。
②依据device_matrix从右到左复制Tensor：按从右到左的顺序，沿着①中device_matrix未使用的维度，依次复制Tensor分片。

本章以3维设备矩阵和2维Tensor举例，讲解推导方法，更多更复杂的切分样例请参照第3节中的推导样例。

# 设备矩阵，假设一共有8张卡
device_matrix = (2, 2, 2)
# 别名，依旧按从左到右的维度顺序命名
alias_name = ("axis_0", "axis_1", "axis_2")
# 张量映射，假设Tensor.shape=(4, 4)
tensor_map = ("axis_2", "axis_0")

2.1 依据tensor_map从左到右切分Tensor

首先依据device_matrix = (2, 2, 2)将设备排成2 * 2 * 2的3维分布：

此处我们可以得到在不同的device_matrix维度上，各个rank之间的对应关系：

对于device_matrix的"axis_0"维度： rank0↔rank4 rank1↔rank5 rank2↔rank6 rank3↔rank7
对于device_matrix的"axis_1"维度： rank0↔rank2 rank1↔rank3 rank4↔rank6 rank5↔rank7
对于device_matrix的"axis_2"维度： rank0↔rank1 rank2↔rank3 rank4↔rank5 rank6↔rank7

接着，初始化一个shape=(4, 4)的Tensor，其tensor_map = ("axis_2", "axis_0")：

首先将Tensor放置在rank0上：

下面依据tensor_map从左到右切分Tensor：

tensor_map[0] = "axis_2"：沿着device_matrix的"axis_2"维度切分Tensor的第0维。

25e7428c-5dba-469b-8606-61ed3cfb301e1230×350 31.8 KB
tensor_map[1] = "axis_0"：沿着device_matrix的"axis_0"维度切分Tensor的第1维。

ed098704-0d2c-45e4-9e98-7ea24f4441691230×350 33.7 KB

此时原本rank0上的Tensor已经切分完毕，但是可以发现rank2, rank3, rank6, rank7上没有Tensor分片，而这正是因为device_matrix的"axis_1"维度并未参与切分，因此需要在该维度上进行复制。

2.2 依据device_matrix从右到左复制Tensor

正如前文所述，由于device_matrix的"axis_1"维度并未参与切分，因此rank2, rank3, rank6, rank7上没有Tensor分片，因此我们需要依据device_matrix从右到左复制Tensor（本例中只剩"axis_1"维度未参与切分，更多维度未参与切分的样例请参照第3节。）：

沿着device_matrix的"axis_1"维度复制Tensor分片：

f1e09f57-963a-4bcc-95ad-7a70f80199021230×350 35.2 KB

至此我们便完成了基于以下Layout可视化推导Tensor分片的流程。

# 设备矩阵，假设一共有8张卡
device_matrix = (2, 2, 2)
# 别名，依旧按从左到右的维度顺序命名
alias_name = ("axis_0", "axis_1", "axis_2")
# 张量映射，假设Tensor.shape=(4, 4)
tensor_map = ("axis_2", "axis_0")

并且其实可以发现如果device_matrix中的每个维度都参与了切分的话，实际上并不需要进行复制，只需要第一步“依据tensor_map从左到右切分Tensor”即可。

为了验证结果的正确性，我们使用原始的列表推导法进行校验（Wiki: 依据dev_matrix和tensor_map推导各卡上的Tensor分片）：

首先列表计算出每个rank上的分片号

3c22d3f5-2370-457a-b5b0-360d87e3ab611203×451 15.2 KB
再依据各个维度的切分数按Z字型切分Tensor（Tensor第0维切2份，第1维切2份），并按照rank号放置分片

de349b91-8b08-41f6-b233-217bbfa4e3891230×325 38.1 KB

可见列表推导法的结果与可视化推导的结果一致，证明了此“基于Layout可视化推导Tensor分片”方法的正确性，更多样例请参照第3节。

3. 推导样例

本节将列举几种常见Layout配置样例，并按照第2节的方法进行推导。

样例	Tensor.shape	device_matrix	alias_name	tensor_map	是否切满	按tensor_map从左到右切分	按device_matrix从右到左复制
3.1	(4, 4)	(2, 4)	(“axis_0”, “axis_1”)	(“axis_1”, “axis_0”)	是	①"axis_1"切tensor第0维；②"axis_0"切tensor第1维	切满无需复制
3.2	(4, 4)	(4, 2)	(“axis_0”, “axis_1”)	(“axis_1”, “None”)	否	①"axis_1"切tensor第0维；②"None"不切	①"axis_0"复制
3.3	(4, 4)	(2, 1, 4)	(“axis_0”, “axis_1”, “axis_2”)	(“axis_0”, “axis_2”)	是	①"axis_0"切tensor第0维；②"axis_2"切tensor第1维；	切满无需复制
3.4	(4, 4)	(2, 2, 2)	(“axis_0”, “axis_1”, “axis_2”)	(“None”, “axis_1”)	否	①"None"不切；②"axis_1"切tensor第2维	①"axis_2"复制；②"axis_0"复制
3.5	(2, 2, 4)	(2, 2, 2)	(“axis_0”, “axis_1”, “axis_2”)	(“axis_1”, “axis_2”, “axis_0”)	是	①"axis_1"切tensor第0维；②"axis_2"切tensor第1维；③"axis_0"切tensor第2维	切满无需复制
3.6	(2, 2, 4)	(2, 2, 2)	(“axis_0”, “axis_1”, “axis_2”)	(“None”, “axis_0”, “None”)	否	①"None"不切；②"axis_0"切tensor第1维；③"None"不切	①"axis_2"复制；②"axis_1"复制
3.7	(2, 2, 2, 2)	(2, 2, 2, 2)	(“axis_0”, “axis_1”, “axis_2”, “axis_3”)	(“None”, “axis_2”, “axis_1”, “None”)	否	①"None"不切；②"axis_2"切tensor第1维；③"axis_1"切tensor第2维；④"None"不切	①"axis_3"复制；②"axis_0"复制

3.1 2维Tensor+2维device_matrix+8卡切满

tensor.shape = (4, 4)
device_matrix = (2, 4)
alias_name = ("axis_0", "axis_1")
tensor_map = ("axis_1", "axis_0")

3.1.1 依据tensor_map从左到右切分Tensor

初始化Tensor和设备矩阵

f1444c57-3b17-4dbf-bfd4-91168dc4ffba1200×500 31 KB
将Tensor放在rank0上

ab36353e-7089-4b0e-979e-061388fbde6b1200×350 29.9 KB
tensor_map[0] = "axis_1"：沿着device_matrix的"axis_1"维度切分Tensor的第0维。

bbab6a50-b3b5-44ac-9a75-6bbbe4d697021200×350 34.1 KB
tensor_map[1] = "axis_0"：沿着device_matrix的"axis_0"维度切分Tensor的第1维。

bfb0651d-2c9e-4a1d-8f1c-bf1fb112a5ba1200×350 38.8 KB

3.1.2 依据device_matrix从右到左复制Tensor

device_matrix的所有维度都参与了切分，并且设备切满了，因此无需复制。至此分片推导完毕。

列表法验证正确性

2caa2ece-504a-4a17-b54b-6e60b8652b111204×451 14.3 KB

dbffe333-c367-4d02-94cf-c89d52becbe41200×309 46.4 KB

3.2 2维Tensor+2维device_matrix+8卡不切满

tensor.shape = (4, 4)
device_matrix = (4, 2)
alias_name = ("axis_0", "axis_1")
tensor_map = ("axis_1", "None")

3.2.1 依据tensor_map从左到右切分Tensor

初始化Tensor和设备矩阵

db886baa-a887-4963-8107-f347e730f3bc1200×800 34.2 KB
将Tensor放在rank0上

0cf0f155-4e51-4b72-8d38-8fcaecd3a9011200×647 33.7 KB
tensor_map[0] = "axis_1"：沿着device_matrix的"axis_1"维度切分Tensor的第0维。

805fb0b8-dc97-4dca-a8bd-b988eec8ad761200×650 35.6 KB
tensor_map[1] = "None"：Tensor第1维不做任何切分，跳过。

3.2.2 依据device_matrix从右到左复制Tensor

"axis_0"维度未参与切分：沿device_matrix的"axis_0"维度复制Tensor分片。

65994d71-6936-4af0-aa08-0775b0f5e2281200×649 50.4 KB

至此分片推导完毕。

列表法验证正确性

da22607d-d72b-4952-82c1-02e6f23d16671094×451 13.5 KB

15f991c0-8d36-439f-8271-6c63fa26dd351200×603 50.2 KB

3.3 2维Tensor+3维device_matrix+8卡切满

tensor.shape = (4, 4)
device_matrix = (2, 1, 4)
alias_name = ("axis_0", "axis_1", "axis_2")
tensor_map = ("axis_0", "axis_2")

3.3.1 依据tensor_map从左到右切分Tensor

初始化Tensor和设备矩阵

4d352a5d-691f-48c2-8ca6-1c76c966ca121230×500 33.8 KB
将Tensor放在rank0上

6b700674-67d0-454e-bf69-1d4afdf9a5501230×352 32.5 KB
tensor_map[0] = "axis_0"：沿着device_matrix的"axis_0"维度切分Tensor的第0维。

b29c2884-72f8-432b-b17f-62dd050ba7941230×350 35.6 KB
tensor_map[1] = "axis_2"：沿着device_matrix的"axis_2"维度切分Tensor的第1维。

b038b957-4456-4e53-860a-f05f7aa30e281230×354 35.7 KB

3.3.2 依据device_matrix从右到左复制Tensor

device_matrix的"axis_1"维度未参与切分，但是"axis_1"==1，并且设备8卡切满，因此无需复制。至此分片推导完毕。

列表法验证正确性

48d1e829-33e9-4e41-a2b9-6f01e5a0a08e1204×451 15.3 KB

98c8a7d4-b8d7-4b8e-9fdb-d0c86621fd931230×305 42 KB

3.4 2维Tensor+3维device_matrix+8卡不切满

tensor.shape = (4, 4)
device_matrix = (2, 2, 2)
alias_name = ("axis_0", "axis_1", "axis_2")
tensor_map = ("None", "axis_1")

3.4.1 依据tensor_map从左到右切分Tensor

初始化Tensor和设备矩阵

fdc2305c-fb84-43f2-96de-096f0422603b1230×502 31 KB
将Tensor放在rank0上

da5d876a-54f2-46ca-bf55-8cd937a141b81230×350 31 KB
tensor_map[0] = "None"：Tensor第0维不做任何切分，跳过。
tensor_map[1] = "axis_1"：沿着device_matrix的"axis_1"维度切分Tensor的第1维。

70240b24-3e72-45d6-ad73-23d344bab4f91230×348 32.7 KB

3.4.2 依据device_matrix从右到左复制Tensor

此样例中剩余2个维度"axis_0"和"axis_2"未参与切分，按照从右到左的顺序"axis_2"->"axis_0"进行复制。

"axis_2"维度未参与切分：沿device_matrix的"axis_2"维度复制Tensor分片。

0cbe5ea4-e42a-47b1-90ba-a5169d5e34fe1230×350 34.3 KB
"axis_0"维度未参与切分：沿device_matrix的"axis_0"维度复制Tensor分片。

0dc3d81c-c182-4747-9f85-007bbdcf2fed1230×350 40.9 KB

至此分片推导完毕。

列表法验证正确性

5d6f7dc3-a32e-4662-b088-499d45c49a071102×451 14.5 KB

12bba793-f805-442d-9206-9d1fc5eca0831230×325 40.3 KB

3.5 3维Tensor+3维device_matrix+8卡切满

tensor.shape = (2, 2, 4)
device_matrix = (2, 2, 2)
alias_name = ("axis_0", "axis_1", "axis_2")
tensor_map = ("axis_1", "axis_0", "axis_2")

3.5.1 依据tensor_map从左到右切分Tensor

初始化Tensor和设备矩阵

79973916-3452-4a3f-893a-accc4503af831230×507 33.9 KB
将Tensor放在rank0上

4407a9a9-660b-4a1d-b883-af41cfe1daaf1230×354 33.1 KB
tensor_map[0] = "axis_1"：沿着device_matrix的"axis_1"维度切分Tensor的第0维。

a174ab22-8ffe-447a-80f7-786541e826b31230×350 35.5 KB
tensor_map[1] = "axis_0"：沿着device_matrix的"axis_0"维度切分Tensor的第1维。

603496d7-157c-4d8f-9c99-3958167e8df41230×350 39.3 KB
tensor_map[2] = "axis_2"：沿着device_matrix的"axis_2"维度切分Tensor的第2维。

647d64fc-99dd-49f0-8da3-9d5cf046d9a91230×355 38.5 KB

3.5.2 依据device_matrix从右到左复制Tensor

device_matrix的所有维度都参与了切分，并且设备切满了，因此无需复制。至此分片推导完毕。

列表法验证正确性

cf511a44-4b85-4167-ab9f-4d8780c752921306×451 15.8 KB

04181f28-d72c-497e-934f-2d59368b2a531230×333 45.3 KB

3.6 3维Tensor+3维device_matrix+8卡不切满

tensor.shape = (2, 2, 4)
device_matrix = (2, 2, 2)
alias_name = ("axis_0", "axis_1", "axis_2")
tensor_map = ("None", "axis_0", "None")

3.6.1 依据tensor_map从左到右切分Tensor

初始化Tensor和设备矩阵

e1b0cfe3-de0e-4855-ab55-bf652e31fe0f1230×510 32.4 KB
将Tensor放在rank0上

79725d09-6b6f-496a-b54f-719fee9aae7c1230×350 31.7 KB
tensor_map[0] = "None"：Tensor第0维不做任何切分，跳过。
tensor_map[1] = "axis_0"：沿着device_matrix的"axis_0"维度切分Tensor的第1维。

c345a180-d698-4c1f-9afd-5b91d4a3ac381230×350 36.3 KB
tensor_map[2] = "None"：Tensor第2维不做任何切分，跳过。

3.6.2 依据device_matrix从右到左复制Tensor

此样例中剩余2个维度"axis_1"和"axis_2"未参与切分，按照从右到左的顺序"axis_2"->"axis_1"进行复制。

"axis_2"维度未参与切分：沿device_matrix的"axis_2"维度复制Tensor分片。

face4521-7d03-4dec-a50c-d2217db2188d1230×350 36.9 KB
"axis_1"维度未参与切分：沿device_matrix的"axis_1"维度复制Tensor分片。

767df57b-b6fc-4263-8f6f-064289277f0f1230×351 39 KB

至此分片推导完毕。

列表法验证正确性

092a049b-6de9-494e-9690-8f431f100aa41102×451 14.7 KB

2fec0a35-1e41-47c0-8928-daab62a25c7f1230×330 39.3 KB

3.7 3维Tensor+4维device_matrix+8卡不切满

tensor.shape = (2, 2, 2, 2)
device_matrix = (2, 2, 2, 2)
alias_name = ("axis_0", "axis_1", "axis_2", "axis_3")
tensor_map = ("None", "axis_2", "axis_1", "None")

3.7.1 依据tensor_map从左到右切分Tensor

初始化Tensor和设备矩阵

5d52838c-f716-42b9-997a-b2df7dfc47a51750×870 54.8 KB
将Tensor放在rank0上（注意这里由于画图原因，rank0方框太小，所以把dim_0对应的2个子tensor平行放置）

3dea03e0-f65c-42f7-92df-30a21e90f30d1680×680 51.3 KB
tensor_map[0] = "None"：Tensor第0维不做任何切分，跳过。
tensor_map[1] = "axis_2"：沿着device_matrix的"axis_2"维度切分Tensor的第1维。

df3d4f53-544f-4d9a-91ec-590b7e65b8c11680×676 54.4 KB
tensor_map[2] = "axis_1"：沿着device_matrix的"axis_1"维度切分Tensor的第2维。

39205136-e6bc-4b9b-9d01-69e347df1c741680×680 59.3 KB
tensor_map[3] = "None"：Tensor第3维不做任何切分，跳过。

3.7.2 依据device_matrix从右到左复制Tensor

此样例中剩余2个维度"axis_0"和"axis_3"未参与切分，按照从右到左的顺序"axis_3"->"axis_0"进行复制。

"axis_3"维度未参与切分：沿device_matrix的"axis_3"维度复制Tensor分片。

263f4815-1108-40cc-b37b-d7fd36c1458b1680×680 65.8 KB
"axis_0"维度未参与切分：沿device_matrix的"axis_0"维度复制Tensor分片。

376a7937-f5c3-4aef-91a7-e81e679418101680×680 79.5 KB

至此分片推导完毕。

列表法验证正确性

82cb1b09-98ff-4435-9e4f-b28b05cb60041264×811 47.3 KB

796d29c0-09a0-4cf7-9dbe-11a1635a9ce81680×660 84.6 KB

话题	回复	浏览量
基于MindSpore Layout推导各卡上的Tensor分片（列表法）分布式并行-Distributed Parallelsim	52	2025 年6 月 19 日
MindSpore报错ValueError: For 'Mul', x.shape and y.shape are supposed to broadcast 功能调试-Function Debugging	6	2025 年8 月 4 日
在NPU上的切片操作x=x[:,::-1,:,:]不生效的分析解决模型训练-Model Training	4	2025 年8 月 6 日
MindSpoer报错：The strategy is ((6, 4), (4,6)), the value of stategy must be the power of 2, but get 6. 分布式并行-Distributed Parallelsim	5	2025 年7 月 25 日
迁移tacotron2网络到MindSpore时遇到torch.tensor.copy_函数缺失功能调试-Function Debugging	2	2025 年8 月 16 日

基于MindSpore Layout推导各卡上的Tensor分片（图解法）

1. Layout简介

1.1 Layout

1.2 device_matrix

1.3 alias_name

1.4 tensor_map

1.5 Tensor分片结果

2. 推导方法

2.1 依据tensor_map从左到右切分Tensor

2.2 依据device_matrix从右到左复制Tensor

3. 推导样例

3.1 2维Tensor+2维device_matrix+8卡切满

3.1.1 依据tensor_map从左到右切分Tensor

3.1.2 依据device_matrix从右到左复制Tensor

3.2 2维Tensor+2维device_matrix+8卡不切满

3.2.1 依据tensor_map从左到右切分Tensor

3.2.2 依据device_matrix从右到左复制Tensor

3.3 2维Tensor+3维device_matrix+8卡切满

3.3.1 依据tensor_map从左到右切分Tensor

3.3.2 依据device_matrix从右到左复制Tensor

3.4 2维Tensor+3维device_matrix+8卡不切满

3.4.1 依据tensor_map从左到右切分Tensor

3.4.2 依据device_matrix从右到左复制Tensor

3.5 3维Tensor+3维device_matrix+8卡切满

3.5.1 依据tensor_map从左到右切分Tensor

3.5.2 依据device_matrix从右到左复制Tensor

3.6 3维Tensor+3维device_matrix+8卡不切满

3.6.1 依据tensor_map从左到右切分Tensor

3.6.2 依据device_matrix从右到左复制Tensor

3.7 3维Tensor+4维device_matrix+8卡不切满

3.7.1 依据tensor_map从左到右切分Tensor

3.7.2 依据device_matrix从右到左复制Tensor

相关话题