Dynamically routes incoming model requests to appropriate LLM based on their varying complexities, thus optimizing response retrieval (by prompting the sufficient model) and saving costs (by not consecutively inferencing larger-sized models). Routing can be guided by a dataset for your own use-case and then two routing strategies are provided:
- Embedding Based Router - Deployed on AWS
- Classification Based Router
The comparison of both strategies is here.
Also, check out the 🤗HF-Collection
You can curate your own dataset for your use case. However, a ~15K sample dataset has been uploaded on 🤗HuggingFace, mapping each query to the appropriate model (3B, 7B, 30B, 70B).
It is based on a KNN router by PulzeAI which is a Go-Server that generates a ranked target list for a query based on its K-Nearest Neighbors. Additionally, we fine-tune our own Embedding model based on BAAI/bge-base-en-v1.5 on the above-mentioned routing dataset. Then after generating deployment artifacts as described below, the routing server is deployed on an AWS-EC2 instance.
Next, the setup procedure for the embedding-based router is given.
We will fine-tune our own router-embedding model based on BAAI/bge-base-en-v1.5 on the above-mentioned routing dataset. The embedding fine-tuning is shown in the Embed_FineTune.ipynb
where we leverage SentenceTransformers training our embedding model with loss function BatchAllTripletLoss
. The training progress is logged on WandB:
To generate deployment artifacts, we need the following dependencies:
points.jsonl
: JSONL-formatted file containing points and their respective categories and embeddings. Each line should contain the following fields:point_uid
,category
, andembedding
.targets.jsonl
: JSONL-formatted file containing the targets and their respective scores for each point. Each line should contain the following fields:point_uid
,target
, andscore
.
The PointsAndTargets.ipynb
can generate these dependencies from our above fine-tuned embedding model and routing dataset, feel free to edit the functions accordingly if your dataset labels are different. After generating push these files to the same embedding model repository on Hub.
Navigate to AWS-EC2 Dashboard, launch an EC2 with Ubuntu-OS and 30GB Disk Volume (to avoid disk space shortage/Instance freezing up) and remaining default configurations which are all suitable under the free tier. Let's set up all the dependencies.
Now set up a Python Virtual Environment.
sudo apt update
sudo apt install python3.12-venv
python3 -m venv .venv
source .venv/bin/activate
Let's install and set up Docker on the EC2.
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc
do
sudo apt-get remove -y "$pkg"
done
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo \"$VERSION_CODENAME\") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
Also, install Go-lang on the instance.
sudo apt install golang-go
go version
Clone the KNN-router repository.
git clone https://github.com/pulzeai-oss/knn-router.git
cd knn-router/deploy/pulze-intent-v0.1
Install a few more dependencies and authenticate your HF account with --token
pip install transformers huggingface_hub
huggingface-cli login --token ''
Install and Initialize Git LFS and clone your embedding model repository from HF which also contains the points and targets dependencies.
sudo apt-get update && sudo apt-get install git-lfs && git lfs install
git clone https://huggingface.co/Muhammad2003/router-embedding
Now you can generate artifacts by providing the --points-data-path
and --scores-data-path
from the cloned repository. Then embeddings.snapshot
and scores.db
artifacts are generated, which then you can also push to the HF repository back to complete the repository for deployment and delete the the current repo.
../../scripts/gen-artifacts.sh --points-data-path ./router-embedding/points.jsonl --scores-data-path ./router-embedding/targets.jsonl --output-dir .
huggingface-cli upload Muhammad2003/router-embedding ./embeddings.snapshot embeddings.snapshot
huggingface-cli upload Muhammad2003/router-embedding ./scores.db scores.db
sudo rm -r ./router-embedding
Download the finalized HF repo.
huggingface-cli download Muhammad2003/router-embedding --local-dir .dist --local-dir-use-symlinks=False
Edit the Docker-compose.yml
file and finally start the server
sed -i 's|--model-id=/srv/run/embedding-model|--model-id=/srv/run/|' docker-compose.yml
sudo docker compose up -d --build
sudo docker ps -a
You can get routing output by a CURL request.
curl -s 127.0.0.1:8888/ \
-X POST \
-d '{"query":"How does pineapple on pizza sound?"}' \
-H 'Content-Type: application/json' | jq .
RESPONSE
"hits": [
{
"id": "801917cd-12de-4dfa-a18a-a8ef51681741",
"category": "3B",
"similarity": 0.99916637
},
{
"id": "32def154-7906-4c25-a17a-8536f38b6e43",
"category": "30B",
"similarity": 0.9991118
},
{
"id": "b724d01a-3041-40e3-8339-938aada6e9f1",
"category": "3B",
"similarity": 0.99910575
},
{
"id": "1a08d6c4-333e-423f-a9ca-0a50fb1115b4",
"category": "3B",
"similarity": 0.99910486
},
{
"id": "1657366b-358d-4e7e-8390-50579500fa1c",
"category": "3B",
"similarity": 0.9991038
},
{
"id": "d6b85ae6-c82f-4a5f-a294-62b00bf65710",
"category": "3B",
"similarity": 0.9990984
},
{
"id": "0052e853-a9f7-4d87-bf7d-deed3a43e23b",
"category": "3B",
"similarity": 0.9990958
},
{
"id": "45bb68b0-bd9d-42b0-8510-dac9d615c8e8",
"category": "3B",
"similarity": 0.9990936
},
{
"id": "bc87a3ea-c4a5-4587-a0e5-9dff42632c48",
"category": "3B",
"similarity": 0.99909335
},
{
"id": "b7af3297-4b3b-4430-8360-e2f95d144727",
"category": "3B",
"similarity": 0.9990907
}
],
"scores": [
{
"target": "3B",
"score": 0.9
},
{
"target": "30B",
"score": 0.1
}
]
}
A simple but ineffective alternative is to train a text classifier on the same data to output the correct label/class for the appropriate model given an input query. TinyLlama_RouterClassifier.ipynb
guides fine-tuning TinyLlama/TinyLlama-1.1B-Chat-v0.6 on the routing dataset for classification task.
Overall classification routers may be effective in outputs but incurs greater inference/storage costs and have high latency, meanwhile embedding routers are cost-effective and have faster-response times thus more suitable for large-scale systems.
Aspect | Tiny LLaMA Classifier | Embedded KNN Router |
---|---|---|
Inference Cost | High (4-5 GB model size) | Low (~500 MB model size) |
Resource Requirements | Significant computational power and storage | Minimal computational power and storage |
GPU Requirement | Requires GPUs | Does not require GPUs |
Accuracy | High, capable of handling complex tasks | Adequate for most routing tasks |
Latency | High latency, slower response times | Low latency, faster response times |
Performance | High accuracy, detailed training | Almost the same as Tiny LLaMA in practical scenarios |
Scalability | Challenging due to high resource demands and costs | Easily scalable, suitable for rapid scaling |