[FEATURE] Add a retry functinality for DeployModelstep and return a model_id #238

owaiskazi19 · 2023-12-01T23:36:57Z

Is your feature request related to a problem?

Currently, we are returning modelId only from RegisterLocalModelStep

flow-framework/src/main/java/org/opensearch/flowframework/workflow/RegisterLocalModelStep.java

Line 249 in aba7dea

Map.entry(MODEL_ID, response.getModelId()),

and RegisterRemoteModelStep

flow-framework/src/main/java/org/opensearch/flowframework/workflow/RegisterRemoteModelStep.java

Line 75 in aba7dea

Map.entry(MODEL_ID, mlRegisterModelResponse.getModelId()),

But when fetching a modelID for other steps like RegisterAgent or Tools or any other workflow steps which requires a modelId, we should wait for the complete model to be deployed and use the modelID returned.

What solution would you like?

This boils down to, add a retry capability for DeployModelStep and wait for the model deployment status to be COMPLETED and return a modelID in the WorkflowData.

What alternatives have you considered?

Add a ?deploy=true flag for RegisterModel steps.

Do you have any additional context?

Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

owaiskazi19 · 2023-12-04T00:52:16Z

Steps to test :

Enable flow framework APIs

curl -i -XPUT "localhost:9200/_cluster/settings" -H "Content-Type:application/json" --data '{"transient":{"plugins.flow_framework.enabled":true}}'

HTTP/1.1 200 OK
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 99

{"acknowledged":true,"persistent":{},"transient":{"plugins":{"flow_framework":{"enabled":"true"}}}}

Enable model registration via url and model deployment on non-ml nodes

curl -i -XPUT "localhost:9200/_cluster/settings" -H "Content-Type:application/json" --data '{"persistent":{"plugins.ml_commons.allow_registering_model_via_url":true,"plugins.ml_commons.only_run_on_ml_node":false}}'
HTTP/1.1 200 OK
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 149

{"acknowledged":true,"persistent":{"plugins":{"ml_commons":{"only_run_on_ml_node":"false","allow_registering_model_via_url":"true"}}},"transient":{}}

Create workflow with the following 3 step template

curl -i -XPOST "localhost:9200/_plugins/_flow_framework/workflow" -H "Content-Type:application/json" --data '{"name":"registermodelgroup-registerlocalmodel-deploymodel","description":"test case","use_case":"TEST_CASE","version":{"template":"1.0.0","compatibility":["2.12.0","3.0.0"]},"workflows":{"provision":{"nodes":[{"id":"workflow_step_1","type":"register_model_group","user_inputs":{"name":"my-model-group-3"}},{"id":"workflow_step_2","type":"register_local_model","previous_node_inputs":{"workflow_step_1":"model_group_id"},"user_inputs":{"node_timeout":"60s","name":"all-MiniLM-L6-v2","version":"1.0.0","description":"test model","model_format":"TORCH_SCRIPT","model_content_hash_value":"c15f0d2e62d872be5b5bc6c84d2e0f4921541e29fefbef51d59cc10a8ae30e0f","model_type":"bert","embedding_dimension":"384","framework_type":"sentence_transformers","all_config":"{\"_name_or_path\":\"nreimers/MiniLM-L6-H384-uncased\",\"architectures\":[\"BertModel\"],\"attention_probs_dropout_prob\":0.1,\"gradient_checkpointing\":false,\"hidden_act\":\"gelu\",\"hidden_dropout_prob\":0.1,\"hidden_size\":384,\"initializer_range\":0.02,\"intermediate_size\":1536,\"layer_norm_eps\":1e-12,\"max_position_embeddings\":512,\"model_type\":\"bert\",\"num_attention_heads\":12,\"num_hidden_layers\":6,\"pad_token_id\":0,\"position_embedding_type\":\"absolute\",\"transformers_version\":\"4.8.2\",\"type_vocab_size\":2,\"use_cache\":true,\"vocab_size\":30522}","url":"https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip"}},{"id":"workflow_step_3","type":"deploy_model","previous_node_inputs":{"workflow_step_2":"model_id"}}],"edges":[{"source":"workflow_step_1","dest":"workflow_step_2"},{"source":"workflow_step_2","dest":"workflow_step_3"}]}}}'
HTTP/1.1 100 Continue

HTTP/1.1 201 Created
content-type: application/json; charset=UTF-8
content-length: 38

{"workflow_id":"GOZKMowB5L1fWsD0wWJY"}

Provision, logs show that all steps completed successfully

curl -i -XPOST "localhost:9200/_plugins/_flow_framework/workflow/GOZKMowB5L1fWsD0wWJY/_provision"
HTTP/1.1 200 OK
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 38

{"workflow_id":"GOZKMowB5L1fWsD0wWJY"}

Logs:

:58,300][INFO ][o.o.f.w.ModelGroupStep   ] [ip-172-31-56-214] Model group registration successful
[2023-12-04T00:46:58,413][INFO ][o.o.f.i.FlowFrameworkIndicesHandler] [ip-172-31-56-214] updated resources created of GOZKMowB5L1fWsD0wWJY
[2023-12-04T00:46:58,413][INFO ][o.o.f.w.ModelGroupStep   ] [ip-172-31-56-214] successfully updated resources created in state index: .plugins-workflow-state
[2023-12-04T00:46:58,414][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Finished workflow_step_1.
[2023-12-04T00:46:58,413][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Starting workflow_step_2.
[2023-12-04T00:46:58,433][INFO ][o.o.f.w.RegisterLocalModelStep] [ip-172-31-56-214] Local Model registration task creation successful
[2023-12-04T00:46:58,480][INFO ][o.o.m.m.MLModelManager   ] [ip-172-31-56-214] create new model meta doc G-ZKMowB5L1fWsD05WKj for register model task GuZKMowB5L1fWsD05WJ3
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: Defaulting to no-operation (NOP) logger implementation
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Downloading: 100% |████████████████████████████████████████| all-MiniLM-L6-v2.zip
[2023-12-04T00:47:01,984][INFO ][o.o.m.m.MLModelManager   ] [ip-172-31-56-214] Model registered successfully, model id: G-ZKMowB5L1fWsD05WKj, task id: GuZKMowB5L1fWsD05WJ3
[2023-12-04T00:47:03,450][INFO ][o.o.f.w.AbstractRetryableWorkflowStep] [ip-172-31-56-214] Local model registration successful for GOZKMowB5L1fWsD0wWJY and modelId G-ZKMowB5L1fWsD05WKj
[2023-12-04T00:47:03,465][INFO ][o.o.f.i.FlowFrameworkIndicesHandler] [ip-172-31-56-214] updated resources created of GOZKMowB5L1fWsD0wWJY
[2023-12-04T00:47:03,465][INFO ][o.o.f.w.AbstractRetryableWorkflowStep] [ip-172-31-56-214] successfully updated resources created in state index: .plugins-workflow-state
[2023-12-04T00:47:03,465][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Finished workflow_step_2.
[2023-12-04T00:47:03,465][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Starting workflow_step_3.
[2023-12-04T00:47:03,470][INFO ][o.o.m.a.d.TransportDeployModelAction] [ip-172-31-56-214] Will deploy model on these nodes: OkZ4g5z7S_umhu691dXt4w
[2023-12-04T00:47:03,480][INFO ][o.o.f.w.DeployModelStep  ] [ip-172-31-56-214] Model deployment state CREATED
[2023-12-04T00:47:05,739][INFO ][o.o.m.e.a.DLModel        ] [ip-172-31-56-214] Model G-ZKMowB5L1fWsD05WKj is successfully deployed on 1 devices
[2023-12-04T00:47:05,754][INFO ][o.o.m.a.f.TransportForwardAction] [ip-172-31-56-214] deploy model done with state: DEPLOYED, model id: G-ZKMowB5L1fWsD05WKj
[2023-12-04T00:47:05,755][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [ip-172-31-56-214] deploy model task done HOZKMowB5L1fWsD0-WIw
[2023-12-04T00:47:08,483][INFO ][o.o.f.w.AbstractRetryableWorkflowStep] [ip-172-31-56-214] Deploy model successful for GOZKMowB5L1fWsD0wWJY and modelId G-ZKMowB5L1fWsD05WKj
[2023-12-04T00:47:08,496][INFO ][o.o.f.i.FlowFrameworkIndicesHandler] [ip-172-31-56-214] updated resources created of GOZKMowB5L1fWsD0wWJY
[2023-12-04T00:47:08,496][INFO ][o.o.f.w.AbstractRetryableWorkflowStep] [ip-172-31-56-214] successfully updated resources created in state index: .plugins-workflow-state
[2023-12-04T00:47:08,497][INFO ][o.o.f.t.ProvisionWorkflowTransportAction] [ip-172-31-56-214] Provisioning completed successfully for workflow GOZKMowB5L1fWsD0wWJY
[2023-12-04T00:47:08,497][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Finished workflow_step_3.
[2023-12-04T00:47:08,509][INFO ][o.o.f.t.ProvisionWorkflowTransportAction] [ip-172-31-56-214] updated workflow GOZKMowB5L1fWsD0wWJY state to COMPLETED

owaiskazi19 · 2023-12-04T00:54:38Z

@dbwiddis got slf4j warnings in the log. I think you fixed this before or not? Not sure.

[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: Defaulting to no-operation (NOP) logger implementation
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

dbwiddis · 2023-12-04T01:10:54Z

@dbwiddis got slf4j warnings in the log. I think you fixed this before or not? Not sure.

This happens if somewhere in the dependency hierarchy you are importing slf4j-api. It's automatically using the slf4j-noop dependency, which means it won't log anything from whatever dependency is using slf4j. See the site referred to in the warning: https://www.slf4j.org/codes.html#StaticLoggerBinder

You can:

Ignore the log error message
Manually install that noop dependency to make that warning message go away and preserve the no-op behavior
Install the logging implementation that redirects slf4j logging to log4j2, to get logging for that dependency
Exclude the transitive slf4j dependency, although I'm not sure what happens if you invoke a logging call in the upstream

owaiskazi19 added enhancement New feature or request untriaged labels Dec 1, 2023

owaiskazi19 mentioned this issue Dec 4, 2023

[Feature/agent_framework] Added Retry functionality for Deploy Model #245

Merged

owaiskazi19 self-assigned this Dec 4, 2023

minalsha removed the untriaged label Dec 4, 2023

owaiskazi19 closed this as completed Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add a retry functinality for DeployModelstep and return a model_id #238

[FEATURE] Add a retry functinality for DeployModelstep and return a model_id #238

owaiskazi19 commented Dec 1, 2023

owaiskazi19 commented Dec 4, 2023

owaiskazi19 commented Dec 4, 2023

dbwiddis commented Dec 4, 2023 •

edited

Loading

[FEATURE] Add a retry functinality for DeployModelstep and return a model_id #238

[FEATURE] Add a retry functinality for DeployModelstep and return a model_id #238

Comments

owaiskazi19 commented Dec 1, 2023

Is your feature request related to a problem?

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

owaiskazi19 commented Dec 4, 2023

owaiskazi19 commented Dec 4, 2023

dbwiddis commented Dec 4, 2023 • edited Loading

dbwiddis commented Dec 4, 2023 •

edited

Loading