llm-as-a-service

Simple FastAPI service for LLAMA-2 7B chat model.

Current version supports only 7B-chat model.

Tested on a single Nvidia L4 GPU (24GB) at GCP (machine type g2-standard-8).

How to run

Install all dependecies

Run:

poetry install

Download llama-2 model

Download llama-2-7b-chat model accordingly to the instruction from llama repository.

Setup environment variables

export RANK="0"
export WORLD_SIZE="1"
export MASTER_ADDR="0.0.0.0"
export MASTER_PORT="2137"
export NCCL_P2P_DISABLE="1"
export OMP_NUM_THREADS=4  # optional

Start the server

Run the following command:

python laas/main.py

Use server

To learn more about endpoints go to http://0.0.0.0:8080/docs

Tests

pytest

Without loading LLM

To run fast tests (no LLM loaded) run

pytest

With loading LLM

For longer tests run

pytest --runslow

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
laas		laas
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-as-a-service

How to run

Install all dependecies

Download llama-2 model

Setup environment variables

Start the server

Use server

Tests

pytest

Without loading LLM

With loading LLM

About

Releases

Packages

Languages

License

mowa-ai/llm-as-a-service

Folders and files

Latest commit

History

Repository files navigation

llm-as-a-service

How to run

Install all dependecies

Download llama-2 model

Setup environment variables

Start the server

Use server

Tests

pytest

Without loading LLM

With loading LLM

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages