🌑

☀️

Hello, friends!

Home Tech Investment Web3 AboutMe Email

Load Your LLM model after Fine-tuning

Recently, i work on the fine tune LLM project. After the fine-tunning model, we actually got the model with two parts:

llama2-7b-hf (original model)
lora adapter (file)– peft

If we just simply merge the model together, it would contribute to around 26 GB, this is really large and slow for running. Also, at the same time, everytime when we want to interact to the model, we need to set the token padding and convert the prompt to torch matrix version, which is not convenient to process for end-user. So in here, we want to deploy our LLM that allow the end-user can quickly and easy to interact with our model.

Step 1

We first need to merge this two model together. So in the lora-gpt.py we load the model and then merge them together. Save the merge model and tokenizer.

# lora-gpt.py
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer,AutoModelForCausalLM
from time import time

from peft import PeftModel

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM

output_dir="/kaggle/working/capstone_fingpt"
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'


hf_auth="<your hf-auth>"
model_id = 'meta-llama/Llama-2-7b-hf'
perft_model='Andy1124233/capstone_fingpt'

#     model_config = transformers.AutoConfig.from_pretrained(
#     model_id,
# )
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map='auto',
    token=hf_auth,
    quantization_config=bnb_config,
    offload_folder="/kaggle/working/"
)
model.model_parallel = True

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,token=hf_auth)

tokenizer.padding_side = "left"

if not tokenizer.pad_token or tokenizer.pad_token_id == tokenizer.eos_token_id:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

model.config.use_cache = False

model = PeftModel.from_pretrained(model,perft_model,offload_folder="/kaggle/working/")
model = model.eval()

model = model.merge_and_unload()
model.save_pretrained("merged")
tokenizer.save_pretrained("token")

If you use the Autodl platform to load the model, remember to swith to save the hugging face cache in the autodl-tmp file and also open the “学术加速“


export HF_HOME=/root/autodl-tmp/cache/
source /etc/network_turbo

After getting the model, you will have two folders (merge,token), inside the folder it includes the safetensors files, move the token related file to the merge folder.

If you use the Autodl, please copy your merge file to autodl-fs

Step 2 Turn the model to gguf format

Since the model size is really large and everytime i need to use the huggingface pre-train model to load it, which is take long time.

So, we need to save the model in one file to easy implement. llama.cpp is a solution to turn the model to one file. Do this on your terminal :


#step 1
git clone https://github.com/ggerganov/llama.cpp.git

# step 2
cd llama.cpp

#step 3
pip install -r requirements.txt

# Step 4 -if you use CUDA
make LLAMA_CUBLAS=1

# Step 4 - if you use cpu
make

#if you run in window, please see this : https://blog.csdn.net/road_of_god/article/details/133901390 or https://zhuanlan.zhihu.com/p/652963043

# step 5 - convert model to gguf format
python ./convert.py <your_model_path> — outfile <output_name>.gguf

autodl User, if you have constrain on your memeory, remember to move your previous merge model to the autodl-fs or autodl-tmp. For my case, the output of the gguf file size is 24 GB, it’s really large.

Step 3 Quantize to 4b

Since the file is too large, we need to quantize it to 4b.

./quantize <model_output_in_step2>.gguf <output_file_name>.gguf Q4_K

The file would decrease from 24 GB to 3GB.

Launch

./main -m <model_name>.gguf --interactive

#or
./main -m <model_name>.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.3

Server

./server -m <your_model>.gguf -ngl 100

You can check in the localhost port 8080

Reference

tech — Jun 4, 2024

Search

Made with ❤ and at Earth.