From 14098d49dd89fd6643bdfbbbd8041d165f914331 Mon Sep 17 00:00:00 2001
From: shibing624 <shibing624@126.com>
Date: Fri, 26 Jan 2024 18:16:00 +0800
Subject: [PATCH] update moe

---
 README.md               | 15 +++++++++------
 README_EN.md            | 12 ++++++++++++
 docs/training_params.md |  1 +
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index 2eb22a0..6a6c94b 100644
--- a/README.md
+++ b/README.md
@@ -30,6 +30,8 @@ Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(
 - DPO方法来自论文[Direct Preference Optimization:Your Language Model is Secretly a Reward Model](https://arxiv.org/pdf/2305.18290.pdf)
 
 ## 🔥 News
+[2024/01/26] v1.8版本：支持微调Mixtral混合专家MoE模型 **[Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)**。详见[Release-v1.8](https://github.com/shibing624/MedicalGPT/releases/tag/1.8.0)
+
 [2024/01/14] v1.7版本：新增检索增强生成(RAG)的基于文件问答[ChatPDF](https://github.com/shibing624/ChatPDF)功能，代码`chatpdf.py`，可以基于微调后的LLM结合知识库文件问答提升行业问答准确率。详见[Release-v1.7](https://github.com/shibing624/MedicalGPT/releases/tag/1.7.0)
 
 [2023/10/23] v1.6版本：新增RoPE插值来扩展GPT模型的上下文长度；针对LLaMA模型支持了[FlashAttention-2](https://github.com/Dao-AILab/flash-attention)和[LongLoRA](https://github.com/dvlab-research/LongLoRA) 提出的 **$S^2$-Attn**；支持了[NEFTune](https://github.com/neelsjain/NEFTune)给embedding加噪训练方法。详见[Release-v1.6](https://github.com/shibing624/MedicalGPT/releases/tag/1.6.0)
@@ -110,12 +112,13 @@ pip install -r requirements.txt --upgrade
 
 #### Hardware Requirement
 
-| Method | Bits |   7B  |  13B  |  30B  |   65B  |
-| ------ | ---- | ----- | ----- | ----- | ------ |
-| Full   |  16  | 160GB | 320GB | 600GB | 1200GB |
-| LoRA   |  16  |  16GB |  32GB |  80GB |  160GB |
-| QLoRA  |   8  |  10GB |  16GB |  40GB |   80GB |
-| QLoRA  |   4  |   6GB |  12GB |  24GB |   48GB |
+
+| 训练方法 | 精度 |   7B  |  13B  |  30B  |   65B  |   8x7B |
+| ------- | ---- | ----- | ----- | ----- | ------ | ------ |
+| 全参数   |  16  | 160GB | 320GB | 600GB | 1200GB |  900GB |
+| LoRA    |  16  |  16GB |  32GB |  80GB |  160GB |  120GB |
+| QLoRA   |   8  |  10GB |  16GB |  40GB |   80GB |   80GB |
+| QLoRA   |   4  |   6GB |  12GB |  24GB |   48GB |   32GB |
 
 ## 🚀 Training Pipeline
 
diff --git a/README_EN.md b/README_EN.md
index 789c68e..e82c2e1 100644
--- a/README_EN.md
+++ b/README_EN.md
@@ -50,6 +50,8 @@ Parameter Description:
 - `--gpus {gpu_ids}`: Specifies the number of GPU devices used, the default is 0. If using multiple GPUs, separate them with commas, such as 0,1,2
 
 
+
+
 ## 🚀 Training Pipeline
 
 ### Stage 1: Continue Pretraining
@@ -114,6 +116,16 @@ sh run_ppo.sh
 ```
 [Training Detail wiki](https://github.com/shibing624/MedicalGPT/wiki/Training-Details)
 
+
+### Hardware Requirement
+
+| Method | Bits |   7B  |  13B  |  30B  |   65B  |   8x7B |
+| ------ | ---- | ----- | ----- | ----- | ------ | ------ |
+| Full   |  16  | 160GB | 320GB | 600GB | 1200GB |  900GB |
+| LoRA   |  16  |  16GB |  32GB |  80GB |  160GB |  120GB |
+| QLoRA  |   8  |  10GB |  16GB |  40GB |   80GB |   80GB |
+| QLoRA  |   4  |   6GB |  12GB |  24GB |   48GB |   32GB |
+
 ## 🔥 Inference 
 After the training is complete, now we load the trained model to verify the effect of the model generating text.
 
diff --git a/docs/training_params.md b/docs/training_params.md
index 23e6a37..ab1271d 100644
--- a/docs/training_params.md
+++ b/docs/training_params.md
@@ -26,6 +26,7 @@
 12. 针对LLaMA模型支持了[FlashAttention-2](https://github.com/Dao-AILab/flash-attention)，如果您使用的是 RTX4090、A100 或 H100 GPU，SFT中请使用 `--flash_attn` 参数以启用 FlashAttention-2
 13. 新增了[LongLoRA](https://github.com/dvlab-research/LongLoRA) 提出的 **$S^2$-Attn**，使模型获得长文本处理能力，SFT中使用 `--shift_attn` 参数以启用该功能
 14. 支持了[NEFTune](https://github.com/neelsjain/NEFTune)给embedding加噪SFT训练方法，[NEFTune paper](https://arxiv.org/abs/2310.05914), SFT中使用 `--neft_alpha` 参数启用 NEFTune，例如 `--neft_alpha 5`
+15. 支持微调Mixtral混合专家MoE模型 **[Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)**，SFT中如果用lora微调模型，可以开启4bit量化和QLoRA`--load_in_4bit True --qlora True`以节省显存，建议设置`--target_modules q_proj,k_proj,v_proj,o_proj`，这样可以避免对MoE专家网络的MLP层量化，因为它们很稀疏且量化后会导致性能效果下降。
 
 **关于PT Training**