Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive unloading of models when loading an additional model #455

Open
Gaddy-BL opened this issue Oct 30, 2023 · 3 comments
Open

Excessive unloading of models when loading an additional model #455

Gaddy-BL opened this issue Oct 30, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@Gaddy-BL
Copy link

Issue Description

We see mm container logs where a thread (here model-load-5e5db6cc) that is loading a model is triggering evacuation of all (or most of) the loaded models.
The evacuation are all triggered in the same millisecond.
The evacutaion triggers are followed by a warning log:

Entire cache capacity of 1835008 units (14336MiB) is now taken up by removed models that are still unloading

The size of the model that we load is 1G size as the loaded models - it should not require unloading so many loaded models.

We are trying to follow the code in ModelMesh.java between the line that sets the thread name (curThread.setName("model-load-" + modelId)) and the log that reports that we are starting to load the model (logger.info("Starting load for model " + modelId + " type=" + modelType)) to understand what triggers the loaded models evacuation.

We'd like to know how did modelmesh decide that it should evacuate so many models and where is this happening in the code.

To Reproduce

We don't have a reproducible way to get this issue but it happens quite often in our cluster.
The issue seems to happen when the GPU memory is loaded with the max number of models it can carry and then we try to load an additional model.

Expected behavior

At most one or two models should be unloaded if space is required to load an additional model with the same characteristics as the loaded model.

Screenshots

The Kibana logs
image

Environment:

We are using version 0.11.0 and run on g4dn.xlarge instance

@Gaddy-BL Gaddy-BL added the bug Something isn't working label Oct 30, 2023
@ckadner
Copy link
Member

ckadner commented Oct 31, 2023

@njhill -- can you provide some of your findings on this one?

@Gaddy-BL
Copy link
Author

Gaddy-BL commented Nov 6, 2023

@ckadner - do you know if someone is looking into this?
Perhaps @njhill ?

@BenHaItay
Copy link

@njhill @ckadner do you guys have any idea what can be the cause ? we might be able to investigate it on our end as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants