Simple FastAPI service for LLAMA-2 7B chat model.
Current version supports only 7B-chat
model.
Tested on a single Nvidia L4 GPU (24GB) at GCP (machine type g2-standard-8
).
Run:
poetry install
Download llama-2-7b-chat
model accordingly to the instruction from llama repository.
export RANK="0"
export WORLD_SIZE="1"
export MASTER_ADDR="0.0.0.0"
export MASTER_PORT="2137"
export NCCL_P2P_DISABLE="1"
export OMP_NUM_THREADS=4 # optional
Run the following command:
python laas/main.py
To learn more about endpoints go to http://0.0.0.0:8080/docs
To run fast tests (no LLM loaded) run
pytest
For longer tests run
pytest --runslow