Recently, i work on the fine tune LLM project. After the fine-tunning model, we actually got the model with two parts:
If we just simply merge the model together, it would contribute to around 26 GB, this is really large and slow for running. Also, at the same time, everytime when we want to interact to the model, we need to set the token padding and convert the prompt to torch matrix version, which is not convenient to process for end-user. So in here, we want to deploy our LLM that allow the end-user can quickly and easy to interact with our model.
We first need to merge this two model together. So in the lora-gpt.py
we load the model and then merge them together. Save the merge model and tokenizer.
# lora-gpt.py
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer,AutoModelForCausalLM
from time import time
from peft import PeftModel
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
output_dir="/kaggle/working/capstone_fingpt"
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
hf_auth="<your hf-auth>"
model_id = 'meta-llama/Llama-2-7b-hf'
perft_model='Andy1124233/capstone_fingpt'
# model_config = transformers.AutoConfig.from_pretrained(
# model_id,
# )
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
device_map='auto',
token=hf_auth,
quantization_config=bnb_config,
offload_folder="/kaggle/working/"
)
model.model_parallel = True
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,token=hf_auth)
tokenizer.padding_side = "left"
if not tokenizer.pad_token or tokenizer.pad_token_id == tokenizer.eos_token_id:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
model.config.use_cache = False
model = PeftModel.from_pretrained(model,perft_model,offload_folder="/kaggle/working/")
model = model.eval()
model = model.merge_and_unload()
model.save_pretrained("merged")
tokenizer.save_pretrained("token")
If you use the Autodl platform to load the model, remember to swith to save the hugging face cache in the autodl-tmp file and also open the “学术加速“
export HF_HOME=/root/autodl-tmp/cache/
source /etc/network_turbo
After getting the model, you will have two folders (merge,token), inside the folder it includes the safetensors files, move the token related file to the merge folder.
If you use the Autodl, please copy your merge file to autodl-fs
Since the model size is really large and everytime i need to use the huggingface pre-train model to load it, which is take long time.
So, we need to save the model in one file to easy implement. llama.cpp
is a solution to turn the model to one file. Do this on your terminal :
#step 1
git clone https://github.com/ggerganov/llama.cpp.git
# step 2
cd llama.cpp
#step 3
pip install -r requirements.txt
# Step 4 -if you use CUDA
make LLAMA_CUBLAS=1
# Step 4 - if you use cpu
make
#if you run in window, please see this : https://blog.csdn.net/road_of_god/article/details/133901390 or https://zhuanlan.zhihu.com/p/652963043
# step 5 - convert model to gguf format
python ./convert.py <your_model_path> — outfile <output_name>.gguf
autodl User, if you have constrain on your memeory, remember to move your previous
merge
model to theautodl-fs
orautodl-tmp
. For my case, the output of the gguf file size is 24 GB, it’s really large.
Since the file is too large, we need to quantize it to 4b.
./quantize <model_output_in_step2>.gguf <output_file_name>.gguf Q4_K
The file would decrease from 24 GB to 3GB.
./main -m <model_name>.gguf --interactive
#or
./main -m <model_name>.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.3
./server -m <your_model>.gguf -ngl 100
You can check in the localhost port 8080
tech — Jun 4, 2024
Made with ❤ and at Earth.