Excessive unloading of models when loading an additional model #455

Gaddy-BL · 2023-10-30T19:45:01Z

Issue Description

We see mm container logs where a thread (here model-load-5e5db6cc) that is loading a model is triggering evacuation of all (or most of) the loaded models.
The evacuation are all triggered in the same millisecond.
The evacutaion triggers are followed by a warning log:

Entire cache capacity of 1835008 units (14336MiB) is now taken up by removed models that are still unloading

The size of the model that we load is 1G size as the loaded models - it should not require unloading so many loaded models.

We are trying to follow the code in ModelMesh.java between the line that sets the thread name (curThread.setName("model-load-" + modelId)) and the log that reports that we are starting to load the model (logger.info("Starting load for model " + modelId + " type=" + modelType)) to understand what triggers the loaded models evacuation.

We'd like to know how did modelmesh decide that it should evacuate so many models and where is this happening in the code.

To Reproduce

We don't have a reproducible way to get this issue but it happens quite often in our cluster.
The issue seems to happen when the GPU memory is loaded with the max number of models it can carry and then we try to load an additional model.

Expected behavior

At most one or two models should be unloaded if space is required to load an additional model with the same characteristics as the loaded model.

Screenshots

The Kibana logs

Environment:

We are using version 0.11.0 and run on g4dn.xlarge instance

The text was updated successfully, but these errors were encountered:

ckadner · 2023-10-31T19:03:08Z

@njhill -- can you provide some of your findings on this one?

Gaddy-BL · 2023-11-06T08:21:51Z

@ckadner - do you know if someone is looking into this?
Perhaps @njhill ?

BenHaItay · 2024-10-09T07:39:36Z

@njhill @ckadner do you guys have any idea what can be the cause ? we might be able to investigate it on our end as well.

Gaddy-BL added the bug Something isn't working label Oct 30, 2023

ckadner assigned njhill Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive unloading of models when loading an additional model #455

Excessive unloading of models when loading an additional model #455

Gaddy-BL commented Oct 30, 2023

ckadner commented Oct 31, 2023

Gaddy-BL commented Nov 6, 2023

BenHaItay commented Oct 9, 2024

Excessive unloading of models when loading an additional model #455

Excessive unloading of models when loading an additional model #455

Comments

Gaddy-BL commented Oct 30, 2023

ckadner commented Oct 31, 2023

Gaddy-BL commented Nov 6, 2023

BenHaItay commented Oct 9, 2024