Merge pull request #136 from iMountTai/main

add text-generation-webui instructions
ymcui · Apr 13, 2023 · c3f58ca · c3f58ca
2 parents 4232d29 + 0b6718a
commit c3f58ca
Show file tree

Hide file tree

Showing 2 changed files with 69 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -208,7 +208,7 @@ cd llama.cpp
 make
 ```
 
-####  Step 2: 生成量化版本模型
+#### Step 2: 生成量化版本模型
 
 将[合并模型](#合并模型)（选择生成`.pth`格式模型）中最后一步生成的`tokenizer.model`文件放入`zh-models`目录下，模型文件`consolidated.*.pth`和配置文件`params.json`放入`zh-models/7B`目录下。请注意LLaMA和Alpaca的`tokenizer.model`不可混用（原因见[训练细节](#训练细节)）。目录结构类似：
 
@@ -255,6 +255,38 @@ python convert-pth-to-ggml.py zh-models/7B/ 1
 --top_p, top_k 控制解码采样的相关参数
 ```
 
+### text-generation-webui
+
+接下来以[text-generation-webui工具](https://github.com/oobabooga/text-generation-webui)为例，介绍无需合并模型即可**本地化部署**的详细步骤
+
+```bash
+# 克隆text-generation-webui
+git clone https://github.com/oobabooga/text-generation-webui
+cd text-generation-webui
+pip install -r requirements.txt
+
+# 将下载后的lora权重放到loras文件夹下
+ls loras/chinese-alpaca-lora-7b
+adapter_config.json  adapter_model.bin  special_tokens_map.json  tokenizer_config.json  tokenizer.model
+
+# 将HuggingFace格式的llama-7B模型文件放到models文件夹下
+ls models/llama-7b-hf
+pytorch_model-00001-of-00002.bin pytorch_model-00002-of-00002.bin config.json pytorch_model.bin.index.json generation_config.json
+
+# 复制lora权重的tokenizer到models/llama-7b-hf下
+cp loras/chinese-alpaca-lora-7b/tokenizer.model models/llama-7b-hf/
+cp loras/chinese-alpaca-lora-7b/special_tokens_map.json models/llama-7b-hf/
+cp loras/chinese-alpaca-lora-7b/tokenizer_config.json models/llama-7b-hf/
+
+# 修改/modules/LoRA.py文件，大约在第28行
+shared.model.resize_token_embeddings(len(shared.tokenizer))
+shared.model = PeftModel.from_pretrained(shared.model, Path(f"{shared.args.lora_dir}/{lora_name}"), **params)
+
+# 接下来就可以愉快的运行了，参考https://github.com/oobabooga/text-generation-webui/wiki/Using-LoRAs
+python server.py --model llama-7b-hf --lora chinese-alpaca-lora-7b
+
+```
+
 ### 使用Transformers推理
 
 如果想快速体验模型效果，不安装其他库或Python包，可以使用[scripts/inference_hf.py](scripts/inference_hf.py)在不量化的情况下启动模型。该脚本支持CPU和GPU的单卡推理。以启动Chinese-Alpaca 7B模型为例，脚本运行方式如下：

diff --git a/README_EN.md b/README_EN.md
@@ -194,6 +194,7 @@ where:
 *(Optional) If necessary, you can convert the `.pth` files generated in this step to HuggingFace format using the script in Step 1.*
 
 ## Quick Deployment
+### llama.cpp
 
 The research community has developed many excellent model quantization and deployment tools to help users **easily deploy large models locally on their own computers (CPU!)**. In the following, we'll take the [llama.cpp tool](https://github.com/ggerganov/llama.cpp) as an example and introduce the detailed steps to quantize and deploy the model on MacOS and Linux systems. For Windows, you may need to install build tools like cmake. **For a local quick deployment experience, it is recommended to use the instruction-finetuned Alpaca model.**
 
@@ -204,7 +205,7 @@ Before running, please ensure:
 3. The system should have `make` (built-in for MacOS/Linux) or `cmake` (need to be installed separately for Windows) build tools.
 4. It is recommended to use Python 3.9 or 3.10 to build and run the [llama.cpp tool](https://github.com/ggerganov/llama.cpp) (since `sentencepiece` does not yet support 3.11).
 
-### Step 1: Clone and build llama.cpp
+#### Step 1: Clone and build llama.cpp
 
 Run the following commands to build the llama.cpp project, generating `./main` and `./quantize` binary files.
 
@@ -214,7 +215,7 @@ cd llama.cpp
 make
 ```
 
-### Step 2: Generate a quantized model
+#### Step 2: Generate a quantized model
 
 Depending on the type of model you want to convert (LLaMA or Alpaca), place the `tokenizer.*` files from the downloaded LoRA model package into the `zh-models` directory, and place the `params.json`  and the `consolidate.*.pth` model file obtained in the last step of [Model Reconstruction](#Model-Reconstruction) into the `zh-models/7B` directory. Note that the `.pth` model file and `tokenizer.model` are corresponding, and the `tokenizer.model` for LLaMA and Alpaca should not be mixed. The directory structure should be similar to:
 
@@ -238,7 +239,7 @@ Further quantize the FP16 model to 4-bit, and generate a quantized model file wi
 ./quantize ./zh-models/7B/ggml-model-f16.bin ./zh-models/7B/ggml-model-q4_0.bin 2
 ```
 
-### Step 3: Load and start the model
+#### Step 3: Load and start the model
 
 Run the `./main` binary file, with the `-m` command specifying the 4-bit quantized model (or loading the ggml-FP16 model). Below is an example of decoding parameters:
 
@@ -257,6 +258,38 @@ Please enter your prompt after the `>`, use `\` as the end of the line for multi
 --top_p, top_k control the sampling parameters
 ```
 
+### text-generation-webui
+
+Next, we will use the [text-generation-webui tool](https://github.com/oobabooga/text-generation-webui) as an example to introduce the detailed steps for local deployment without the need for model merging.
+
+```bash
+# clone text-generation-webui
+git clone https://github.com/oobabooga/text-generation-webui
+cd text-generation-webui
+pip install -r requirements.txt
+
+# put the downloaded lora weights into the loras folder.
+ls loras/chinese-alpaca-lora-7b
+adapter_config.json  adapter_model.bin  special_tokens_map.json  tokenizer_config.json  tokenizer.model
+
+# put the HuggingFace-formatted llama-7B model files into the models  folder.
+ls models/llama-7b-hf
+pytorch_model-00001-of-00002.bin pytorch_model-00002-of-00002.bin config.json pytorch_model.bin.index.json generation_config.json
+
+# copy the tokenizer of lora weights to the models/llama-7b-hf directory
+cp loras/chinese-alpaca-lora-7b/tokenizer.model models/llama-7b-hf/
+cp loras/chinese-alpaca-lora-7b/special_tokens_map.json models/llama-7b-hf/
+cp loras/chinese-alpaca-lora-7b/tokenizer_config.json models/llama-7b-hf/
+
+# modify /modules/LoRA.py file
+shared.model.resize_token_embeddings(len(shared.tokenizer))
+shared.model = PeftModel.from_pretrained(shared.model, Path(f"{shared.args.lora_dir}/{lora_name}"), **params)
+
+# Great! You can now run the tool. Please refer to https://github.com/oobabooga/text-generation-webui/wiki/Using-LoRAs for instructions on how to use LoRAs
+python server.py --model llama-7b-hf --lora chinese-alpaca-lora-7b
+
+```
+
 ## System Performance
 
 In order to quickly evaluate the actual performance of related models, this project compared the effects of Chinese Alpaca-7B and Chinese Alpaca-13B on some common tasks given the same prompt. The test models are all **4-bit quantized models**, and the theoretical effect is worse than the non-quantized version. Reply generation is random and is affected by factors such as decoding hyperparameters and random seeds. The following related evaluations are not absolutely rigorous, and the test results are for reference only. Welcome to experience it yourself. For detailed evaluation results, please see [examples/README.md](./examples/README.md)