Simple yet Production Ready MLOps example of using any generating model
Demo.mp4
Nowadays, with generational neural networks increasingly becoming a significant part of our life, many devs got to intergrate them into production ready systems, no matter whether the system is a big entreprise or just a wrapper for such a model with some web, mobile or bot user interface. Programmatic madia creating, such as generating social media content like posts, videos, music and etc isn't already some new thing for now, but with neural networks becoming so popular and moreover affordable for ordinar people, makes such models so desirable to use.
You can get some model weights, it's architecture implementaion and just generate some content with it by hand - it's not that complex. But when it comes to integrating such a model into production ready system, even as simple as just that model wrapper, it's no longer that straighforward. You gotta think about load balancing, queued tasks, perfomance control, durability, rodustness and resource and health monitoring. CI/CD becomes a thing too since deploying process becomes complex and you don't wanna do that with your bare hands everytime.
This project uses Stable Diffusion txt2img model via using diffussers library to demonstrate such an example of how to implement simple production ready model wrapper, but with the idea you can use any other generational model like:
- txt2img
- img2img
- txt2txt
- tts
- stt
- txt2video
- img2video
It also provides a simple demo web interface shown above to interact with, but you can use the arcitectual approache to implement any other way of interating via, maybe you would like to create a telegram bot or native mobile app, or use it as an external API via some RESTful API, RPC, gRPC or message queues - feel free to pick any way that suits your case best
Since the meaning of the project is to implement as easy to use production ready model wrapper as possible, installation process is simple too. It uses Docker so it's required to be installed to build and run the project:
- Clone the project:
git clone https://github.com/Dominux/commercial-studio-photos-generator.git
- Copy
.env.example
to.env
:
cp ./.env.example ./.env
and you can optionally edit it as you wish
- Build and run the project:
make up
Once it ran you can access localhost:8000 to get the webpage.
If something went wrong or you're just willing to check the logs, you can do that via looking into logs of cspg-server
and cspg-worker
services:
docker logs -f cspg-server
and
docker logs -f cspg-worker
It uses Web-Queue-Worker pattern by obvious reasons. The whole architecture is presented below:
With scalability being extremelly necessery in case of such applications, this architecture brings ability to simply scale:
- inner traffic by increasing Server workers amount
- amount of messages to be queued by... increasing Queue capacity
- model perfomance by increasing Worker workers amount
This architecture also allows to distribute the whole system between many machines, like taking all the services into different machines or even increase generational perfomance by using different GPUs by using a separate instance of Worker worker on each GPU unit.
To make able to use such a distributed system, the app provides a centralized service to store file objects (MinIO) and a cache system (Redis) to speed up checking for done results that also increases the system's overall perfomance.
As was said before, to speed up generation process, you need to examine the GPU usage first. If it doesn't use it on 100%, increasing amount of parallel generation processes can increase total perfomance. Nevertheless better to increase usage of a single GPU up to 100% because it won't change overall perfomance, but it will speed up generating process when you have only one user at a time, at least. In this case increasing workers amount on a single GPU won't result in any acceleration since you already uses it on its maximum, but will drammatically multiply RAM and VRAM usage, I'm pretty sure no one interested in such a waste of resources for nothing.
To demonstrate how the perfomance stays the same with increasing workers amount, I performed tests on running the same process of parallel 10 requests on generating the same product. Even though workers were really taking multiple messages at the same time, their speed was multiply slowed down. So here's the chart of running this test for 1-4 workers (since I got only 16GB RAM and 12GB of VRAM on my home machine and renting VDS even with same GPU is so expensive):
So, as I said before, in this case increasing perfomance should be performed by increasing amount of GPUs. Direct RabbitMQ Exchange I use provides ability to fan out messages from a queue to multiple consumers so it allows such a way of generation acceleration.
-
Frontend
- HTMX - I gave it a try and it perfoms well, especially in such a simple cases
- Materialize - CSS library cause I try to avoid using JS with HTMX
-
Backend
- Go - its simple intuitive syntax and perfomance make it the best choice when it comes to create a fast microservices fast, with multiple connections to different systems
- Python - the language with the easiest prototyping ML way
- RabbitMQ - message broker store messages and to push them to their consumers
- MinIO - an object storage. I use it for centralized storing generation results
- Redis - an in-memory database. I use it for storing results IDs for the server to check if the result is already done
-
Model
- epiCPhotoGazm lastUnicorn - realistic Stable Diffusion checkpoint
- easynegative - simple Stable Diffusion negative embedding
- DPM++ 2M Karras - Stable Diffusion scheduler for generating better result
I would use websockets or SSE for that, but I used a good ol' polling (even not a long polling) and here's why:
- Generation process takes not nanoseconds or even milliseconds to perform. It can take seconds or even minutes for just a single task to be done. So realtime communication isn't a thing here at all
- WS and SSE require to keep alive stateful connections. And with increasing parallel users and the slowing the process down it starts to take RAM and CPU usage so wild. So in terms of handling as much as possible users at the same time, simple polling strategy is the best choice cause it's easy to adjust the interval and the user won't need to keep polling the server once they will get the result. WS and SSE is rather for complext cases afterall
Even though it's only a choice of mine and you can feel free to pick the way you wish. Afterall it wasn't the project's main point.