使用单卡Ascend910进行LLaMA2-7B推理,速度缓慢

1. 系统环境

硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.2
执行模式(PyNative/ Graph): Graph

2. 报错信息

使用单卡进行LLaMA2-7B推理,推理速度只有3.75token每秒。
设备是910,MindSpore==2.2.11, mindformers==1.0.0

3. 解决方案

开启增量推理,开启后推理速度正常

IMF0:  10.254.0.1:13452 - "POST /generate HTTP/2.1" 200  0%  
2024-04-23 03:55:32,127 - mindformers[mindformers/generation/text_generator.py:478] - INFO - total time: 21.01030683517456 s; generated tokens: 504 tokens; generate speed: 23.988226538235256 tokens/s  
2024-04-23 03:55:32,142 - mindformers[mindformers-1-0.0/chat_web/predict process.py:122] - INFO - 0 card generate output: it ts a city that has everything. It is a city of contrasts, where you can find the most  modern skyscrapers and the sost ancient teaples in the same street.

试试mindnlp,用最新的0.5.0rc2可以直接跑transformers原版