1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend/GPU/CPU
MindSpore版本: 2.1
执行模式(PyNative/ Graph): 不限
2 报错信息
2.1 问题描述
自回归推理时,不会使用KV cache,每次生成的token会拼接在当前输入的最后位置,作为下一次的输入。在mindspore中通过配置use_past设为False来开启自回归推理,在huggingface的transformer中模型配置文件config.json中配置项use_cache表示是否开启KV cache,但是直接设置use_cache为false并不会生效。
{
_name_or_path": "bigcode/starcoder"
"activation_function": "gelu"
"architectures": [
"GPTBigCodeForCausalLM'
"attention softma x in fp32": true,
"attn pdrop": 0.1,
"bos token id": 0,
"embd pdrop": 0.1,
"eos_token_id": 0,
"inference_runner": 0,
"initializer_range": 0.02,
"layer_nom_epsilon": 1e-05,
"max_batch_size": null,
"max_sequence_length": null,
"model_type": "gpt_bigcode",
"multi_query": true,
"n_embd": 6144,
"n_head": 48,
"n_inner": 24576,
"n_layer": 40,
"n_positions": 8192
"pad_key_length": true,
"pre_allocate_kv_cache": false,
"resid_pdrop": 0.1,
"scale_attention_so ftmax_in_fp32": false,
"scale_attn weights": true,
"summary_ activation": null,
"summary first dropout": 0.1,
"summary_proj_to labels": true,
"summary_type": "cls index",
"summary_use_ proj": true,
"torch dtype": "float16",
"transfommers_version": "4.30.0.dev0",
"use cache": true,
vdLiudLe rumier_.Lnput": true,
"vocab size": 49153
}
3 解决方案
需要在model.generate方法中也传入use_cache=False。
with torch.no_grad():
generation output = model -generate(
input ids=input ids,
generation config=generati on_config,
return dict in genera te=True,
output scores=True,
max new tokens=max new tokens,
output hidden states=True,
use cache=False
)