模型Yi-34B-4bits 版本部署

简介

最近试了很多开源的大语言模型，对比效果后发现Yi-34B-4bits的量化版本效果最优，这里简单总结下模型部署的相关知识，及遇到的一些问题。

模型下载

可以科学上网的安装起来很方便，在安装完git-lfs后，直接clone模型仓库

1	git clone https://huggingface.co/01-ai/Yi-34B-Chat-4bits

无法科学上网的，可以使用镜像仓库

1	git clone https://hf-mirror.com/01-ai/Yi-34B-Chat-4bits

但镜像网站好像是有限流，下载多个模型的时候容易卡住，得等一段时间。这里会下载比较多的历史版本，我这里使用脚本逐个下载模型文件，然后用git clone其余小文件。

环境安装

由于使用4bit的量化版本，着重强调的是需要安装量化包autoawq，如果不安装，读取加载模型速度非常慢，大概需要20min，而且后续加载会报错，无法使用。这里需要注意transformers的版本和pytorch的版本，这里我使用的是transformers=4.35.2及torch=2.0.1+cu117。但是升级了之后导致8bits和baichuan2的读取报错，又对transformers包进行了4.33版本回退，如果不介意的话可以conda两个环境。直接使用pip install 会让从release安装，如果你是cuda12以上，直接用即可。

1	pip install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.7/autoawq-0.1.7+cu118-cp310-cp310-linux_x86_64.whl

其余需要的包, 可以根据官方git进行安装。依赖包

模型服务

服务部署

这里使用fastapi搭建web服务，代码如下:

from fastapi import FastAPI
from fastapi import HTTPException
from pydantic import BaseModel

from transformers import AutoModelForCausalLM, AutoTokenizer
import uvicorn
import os
import time


os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"
app = FastAPI()


class RequestItem(BaseModel):
    """message"""
    message: str
    temperature: float = 0.8
    max_length: float = 3000
    repetition_penalty: float = 1.2


@app.post("/v1/completions")
async def generate_text(request_item: RequestItem):
    """generate_text"""
    try:
        # 在这里处理接收到的 JSON 请求
        content = request_item.message
        messages = [
            {"role": "user", "content": content}
        ]
        model.generation_config.temperature = request_item.temperature
        model.generation_config.max_length = request_item.max_length
        model.generation_config.repetition_penalty = request_item.repetition_penalty

        input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt')
        output_ids = model.generate(input_ids.to('cuda'))
        response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
        return response

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


if __name__ == "__main__":
    model_path = ""  # 这里写自己的模型下载地址
    tokenizer = AutoTokenizer.from_pretrained(
        model_path,
        trust_remote_code=True
    )

    print("start to load model")
    start = time.time()
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        # use_flash_attention_2=True,
        torch_dtype='auto'
    )
    model = model.eval()
    end = time.time()
    print(f"load model spend time: {end-start}")
    uvicorn.run(app, host='0.0.0.0', port=8891, workers=1)

服务调用

def get_from_llm(content):
    """从大模型获取结果"""
    url = "http://localhost:8891/v1/completions"
    messages = {"message" : content}  # 这里可以传解码参数，不传的话会使用默认。
    res = requests.post(url=url, json=messages).json()
    return res

Streamlit网页服务部署

使用网页搭建服务，可以更好的展示模型效果，这里使用Streamlit框架进行构建，也可以使用自己喜欢的框架进行部署。这里只实现单轮对话，多轮对话及流式回复暂未实现。

import json
import requests
import streamlit as st
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"


st.set_page_config(page_title="Yi-34B-Chat-4bits")
st.title("Yi-34B-Chat-4bits")


@st.cache_resource
def init_model():
    """
    初始化
    """
    model_path = ""  # 这里填入自己的模型路径
    tokenizer = AutoTokenizer.from_pretrained(
        model_path,
        trust_remote_code=True
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        torch_dtype='auto'
    )
    return model, tokenizer


st.markdown("## 📌 单轮对话系统")

def get_from_llm(content):
    """从大模型接口获取结果"""
    url = "http://localhost:8891/v1/completions"
    messages = {"message" : content}  # 这里可以传解码参数，不传的话会使用默认。
    res = requests.post(url=url, json=messages).json()
    return res

def text_generate():
    """文本生成"""
    input = st.text_input("请输入指令")
    if input:
        result = get_from_llm(input)
        st.markdown(result)
    

if __name__ == "__main__":
    text_generate()