Skip to content

Commit

Permalink
Merge pull request #136 from iMountTai/main
Browse files Browse the repository at this point in the history
add text-generation-webui instructions
  • Loading branch information
ymcui authored Apr 13, 2023
2 parents 4232d29 + 0b6718a commit c3f58ca
Show file tree
Hide file tree
Showing 2 changed files with 69 additions and 4 deletions.
34 changes: 33 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ cd llama.cpp
make
```

#### Step 2: 生成量化版本模型
#### Step 2: 生成量化版本模型

[合并模型](#合并模型)(选择生成`.pth`格式模型)中最后一步生成的`tokenizer.model`文件放入`zh-models`目录下,模型文件`consolidated.*.pth`和配置文件`params.json`放入`zh-models/7B`目录下。请注意LLaMA和Alpaca的`tokenizer.model`不可混用(原因见[训练细节](#训练细节))。目录结构类似:

Expand Down Expand Up @@ -255,6 +255,38 @@ python convert-pth-to-ggml.py zh-models/7B/ 1
--top_p, top_k 控制解码采样的相关参数
```

### text-generation-webui

接下来以[text-generation-webui工具](https://github.com/oobabooga/text-generation-webui)为例,介绍无需合并模型即可**本地化部署**的详细步骤

```bash
# 克隆text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

# 将下载后的lora权重放到loras文件夹下
ls loras/chinese-alpaca-lora-7b
adapter_config.json adapter_model.bin special_tokens_map.json tokenizer_config.json tokenizer.model

# 将HuggingFace格式的llama-7B模型文件放到models文件夹下
ls models/llama-7b-hf
pytorch_model-00001-of-00002.bin pytorch_model-00002-of-00002.bin config.json pytorch_model.bin.index.json generation_config.json

# 复制lora权重的tokenizer到models/llama-7b-hf下
cp loras/chinese-alpaca-lora-7b/tokenizer.model models/llama-7b-hf/
cp loras/chinese-alpaca-lora-7b/special_tokens_map.json models/llama-7b-hf/
cp loras/chinese-alpaca-lora-7b/tokenizer_config.json models/llama-7b-hf/

# 修改/modules/LoRA.py文件,大约在第28行
shared.model.resize_token_embeddings(len(shared.tokenizer))
shared.model = PeftModel.from_pretrained(shared.model, Path(f"{shared.args.lora_dir}/{lora_name}"), **params)

# 接下来就可以愉快的运行了,参考https://github.com/oobabooga/text-generation-webui/wiki/Using-LoRAs
python server.py --model llama-7b-hf --lora chinese-alpaca-lora-7b

```

### 使用Transformers推理

如果想快速体验模型效果,不安装其他库或Python包,可以使用[scripts/inference_hf.py](scripts/inference_hf.py)在不量化的情况下启动模型。该脚本支持CPU和GPU的单卡推理。以启动Chinese-Alpaca 7B模型为例,脚本运行方式如下:
Expand Down
39 changes: 36 additions & 3 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ where:
*(Optional) If necessary, you can convert the `.pth` files generated in this step to HuggingFace format using the script in Step 1.*
## Quick Deployment
### llama.cpp
The research community has developed many excellent model quantization and deployment tools to help users **easily deploy large models locally on their own computers (CPU!)**. In the following, we'll take the [llama.cpp tool](https://github.com/ggerganov/llama.cpp) as an example and introduce the detailed steps to quantize and deploy the model on MacOS and Linux systems. For Windows, you may need to install build tools like cmake. **For a local quick deployment experience, it is recommended to use the instruction-finetuned Alpaca model.**

Expand All @@ -204,7 +205,7 @@ Before running, please ensure:
3. The system should have `make` (built-in for MacOS/Linux) or `cmake` (need to be installed separately for Windows) build tools.
4. It is recommended to use Python 3.9 or 3.10 to build and run the [llama.cpp tool](https://github.com/ggerganov/llama.cpp) (since `sentencepiece` does not yet support 3.11).

### Step 1: Clone and build llama.cpp
#### Step 1: Clone and build llama.cpp

Run the following commands to build the llama.cpp project, generating `./main` and `./quantize` binary files.

Expand All @@ -214,7 +215,7 @@ cd llama.cpp
make
```

### Step 2: Generate a quantized model
#### Step 2: Generate a quantized model

Depending on the type of model you want to convert (LLaMA or Alpaca), place the `tokenizer.*` files from the downloaded LoRA model package into the `zh-models` directory, and place the `params.json` and the `consolidate.*.pth` model file obtained in the last step of [Model Reconstruction](#Model-Reconstruction) into the `zh-models/7B` directory. Note that the `.pth` model file and `tokenizer.model` are corresponding, and the `tokenizer.model` for LLaMA and Alpaca should not be mixed. The directory structure should be similar to:

Expand All @@ -238,7 +239,7 @@ Further quantize the FP16 model to 4-bit, and generate a quantized model file wi
./quantize ./zh-models/7B/ggml-model-f16.bin ./zh-models/7B/ggml-model-q4_0.bin 2
```
### Step 3: Load and start the model
#### Step 3: Load and start the model
Run the `./main` binary file, with the `-m` command specifying the 4-bit quantized model (or loading the ggml-FP16 model). Below is an example of decoding parameters:
Expand All @@ -257,6 +258,38 @@ Please enter your prompt after the `>`, use `\` as the end of the line for multi
--top_p, top_k control the sampling parameters
```
### text-generation-webui
Next, we will use the [text-generation-webui tool](https://github.com/oobabooga/text-generation-webui) as an example to introduce the detailed steps for local deployment without the need for model merging.
```bash
# clone text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
# put the downloaded lora weights into the loras folder.
ls loras/chinese-alpaca-lora-7b
adapter_config.json adapter_model.bin special_tokens_map.json tokenizer_config.json tokenizer.model
# put the HuggingFace-formatted llama-7B model files into the models folder.
ls models/llama-7b-hf
pytorch_model-00001-of-00002.bin pytorch_model-00002-of-00002.bin config.json pytorch_model.bin.index.json generation_config.json
# copy the tokenizer of lora weights to the models/llama-7b-hf directory
cp loras/chinese-alpaca-lora-7b/tokenizer.model models/llama-7b-hf/
cp loras/chinese-alpaca-lora-7b/special_tokens_map.json models/llama-7b-hf/
cp loras/chinese-alpaca-lora-7b/tokenizer_config.json models/llama-7b-hf/
# modify /modules/LoRA.py file
shared.model.resize_token_embeddings(len(shared.tokenizer))
shared.model = PeftModel.from_pretrained(shared.model, Path(f"{shared.args.lora_dir}/{lora_name}"), **params)
# Great! You can now run the tool. Please refer to https://github.com/oobabooga/text-generation-webui/wiki/Using-LoRAs for instructions on how to use LoRAs
python server.py --model llama-7b-hf --lora chinese-alpaca-lora-7b
```
## System Performance
In order to quickly evaluate the actual performance of related models, this project compared the effects of Chinese Alpaca-7B and Chinese Alpaca-13B on some common tasks given the same prompt. The test models are all **4-bit quantized models**, and the theoretical effect is worse than the non-quantized version. Reply generation is random and is affected by factors such as decoding hyperparameters and random seeds. The following related evaluations are not absolutely rigorous, and the test results are for reference only. Welcome to experience it yourself. For detailed evaluation results, please see [examples/README.md](./examples/README.md)
Expand Down

0 comments on commit c3f58ca

Please sign in to comment.