1. 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.2
执行模式(PyNative/ Graph): Graph
2. 报错信息
使用单卡进行LLaMA2-7B推理,推理速度只有3.75token每秒。
设备是910,MindSpore==2.2.11, mindformers==1.0.0
3. 解决方案
开启增量推理,开启后推理速度正常
IMF0: 10.254.0.1:13452 - "POST /generate HTTP/2.1" 200 0%
2024-04-23 03:55:32,127 - mindformers[mindformers/generation/text_generator.py:478] - INFO - total time: 21.01030683517456 s; generated tokens: 504 tokens; generate speed: 23.988226538235256 tokens/s
2024-04-23 03:55:32,142 - mindformers[mindformers-1-0.0/chat_web/predict process.py:122] - INFO - 0 card generate output: it ts a city that has everything. It is a city of contrasts, where you can find the most modern skyscrapers and the sost ancient teaples in the same street.