Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add a retry functinality for DeployModelstep and return a model_id #238

Closed
owaiskazi19 opened this issue Dec 1, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@owaiskazi19
Copy link
Member

Is your feature request related to a problem?

Currently, we are returning modelId only from RegisterLocalModelStep


and RegisterRemoteModelStep
Map.entry(MODEL_ID, mlRegisterModelResponse.getModelId()),

But when fetching a modelID for other steps like RegisterAgent or Tools or any other workflow steps which requires a modelId, we should wait for the complete model to be deployed and use the modelID returned.

What solution would you like?

This boils down to, add a retry capability for DeployModelStep and wait for the model deployment status to be COMPLETED and return a modelID in the WorkflowData.

What alternatives have you considered?

Add a ?deploy=true flag for RegisterModel steps.

Do you have any additional context?

Add any other context or screenshots about the feature request here.

@owaiskazi19 owaiskazi19 added enhancement New feature or request untriaged labels Dec 1, 2023
@owaiskazi19
Copy link
Member Author

Steps to test :

  1. Enable flow framework APIs
curl -i -XPUT "localhost:9200/_cluster/settings" -H "Content-Type:application/json" --data '{"transient":{"plugins.flow_framework.enabled":true}}'

HTTP/1.1 200 OK
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 99

{"acknowledged":true,"persistent":{},"transient":{"plugins":{"flow_framework":{"enabled":"true"}}}}
  1. Enable model registration via url and model deployment on non-ml nodes
curl -i -XPUT "localhost:9200/_cluster/settings" -H "Content-Type:application/json" --data '{"persistent":{"plugins.ml_commons.allow_registering_model_via_url":true,"plugins.ml_commons.only_run_on_ml_node":false}}'
HTTP/1.1 200 OK
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 149

{"acknowledged":true,"persistent":{"plugins":{"ml_commons":{"only_run_on_ml_node":"false","allow_registering_model_via_url":"true"}}},"transient":{}}
  1. Create workflow with the following 3 step template
curl -i -XPOST "localhost:9200/_plugins/_flow_framework/workflow" -H "Content-Type:application/json" --data '{"name":"registermodelgroup-registerlocalmodel-deploymodel","description":"test case","use_case":"TEST_CASE","version":{"template":"1.0.0","compatibility":["2.12.0","3.0.0"]},"workflows":{"provision":{"nodes":[{"id":"workflow_step_1","type":"register_model_group","user_inputs":{"name":"my-model-group-3"}},{"id":"workflow_step_2","type":"register_local_model","previous_node_inputs":{"workflow_step_1":"model_group_id"},"user_inputs":{"node_timeout":"60s","name":"all-MiniLM-L6-v2","version":"1.0.0","description":"test model","model_format":"TORCH_SCRIPT","model_content_hash_value":"c15f0d2e62d872be5b5bc6c84d2e0f4921541e29fefbef51d59cc10a8ae30e0f","model_type":"bert","embedding_dimension":"384","framework_type":"sentence_transformers","all_config":"{\"_name_or_path\":\"nreimers/MiniLM-L6-H384-uncased\",\"architectures\":[\"BertModel\"],\"attention_probs_dropout_prob\":0.1,\"gradient_checkpointing\":false,\"hidden_act\":\"gelu\",\"hidden_dropout_prob\":0.1,\"hidden_size\":384,\"initializer_range\":0.02,\"intermediate_size\":1536,\"layer_norm_eps\":1e-12,\"max_position_embeddings\":512,\"model_type\":\"bert\",\"num_attention_heads\":12,\"num_hidden_layers\":6,\"pad_token_id\":0,\"position_embedding_type\":\"absolute\",\"transformers_version\":\"4.8.2\",\"type_vocab_size\":2,\"use_cache\":true,\"vocab_size\":30522}","url":"https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip"}},{"id":"workflow_step_3","type":"deploy_model","previous_node_inputs":{"workflow_step_2":"model_id"}}],"edges":[{"source":"workflow_step_1","dest":"workflow_step_2"},{"source":"workflow_step_2","dest":"workflow_step_3"}]}}}'
HTTP/1.1 100 Continue

HTTP/1.1 201 Created
content-type: application/json; charset=UTF-8
content-length: 38

{"workflow_id":"GOZKMowB5L1fWsD0wWJY"}
  1. Provision, logs show that all steps completed successfully
curl -i -XPOST "localhost:9200/_plugins/_flow_framework/workflow/GOZKMowB5L1fWsD0wWJY/_provision"
HTTP/1.1 200 OK
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 38

{"workflow_id":"GOZKMowB5L1fWsD0wWJY"}

Logs:

:58,300][INFO ][o.o.f.w.ModelGroupStep   ] [ip-172-31-56-214] Model group registration successful
[2023-12-04T00:46:58,413][INFO ][o.o.f.i.FlowFrameworkIndicesHandler] [ip-172-31-56-214] updated resources created of GOZKMowB5L1fWsD0wWJY
[2023-12-04T00:46:58,413][INFO ][o.o.f.w.ModelGroupStep   ] [ip-172-31-56-214] successfully updated resources created in state index: .plugins-workflow-state
[2023-12-04T00:46:58,414][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Finished workflow_step_1.
[2023-12-04T00:46:58,413][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Starting workflow_step_2.
[2023-12-04T00:46:58,433][INFO ][o.o.f.w.RegisterLocalModelStep] [ip-172-31-56-214] Local Model registration task creation successful
[2023-12-04T00:46:58,480][INFO ][o.o.m.m.MLModelManager   ] [ip-172-31-56-214] create new model meta doc G-ZKMowB5L1fWsD05WKj for register model task GuZKMowB5L1fWsD05WJ3
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: Defaulting to no-operation (NOP) logger implementation
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Downloading: 100% |████████████████████████████████████████| all-MiniLM-L6-v2.zip
[2023-12-04T00:47:01,984][INFO ][o.o.m.m.MLModelManager   ] [ip-172-31-56-214] Model registered successfully, model id: G-ZKMowB5L1fWsD05WKj, task id: GuZKMowB5L1fWsD05WJ3
[2023-12-04T00:47:03,450][INFO ][o.o.f.w.AbstractRetryableWorkflowStep] [ip-172-31-56-214] Local model registration successful for GOZKMowB5L1fWsD0wWJY and modelId G-ZKMowB5L1fWsD05WKj
[2023-12-04T00:47:03,465][INFO ][o.o.f.i.FlowFrameworkIndicesHandler] [ip-172-31-56-214] updated resources created of GOZKMowB5L1fWsD0wWJY
[2023-12-04T00:47:03,465][INFO ][o.o.f.w.AbstractRetryableWorkflowStep] [ip-172-31-56-214] successfully updated resources created in state index: .plugins-workflow-state
[2023-12-04T00:47:03,465][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Finished workflow_step_2.
[2023-12-04T00:47:03,465][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Starting workflow_step_3.
[2023-12-04T00:47:03,470][INFO ][o.o.m.a.d.TransportDeployModelAction] [ip-172-31-56-214] Will deploy model on these nodes: OkZ4g5z7S_umhu691dXt4w
[2023-12-04T00:47:03,480][INFO ][o.o.f.w.DeployModelStep  ] [ip-172-31-56-214] Model deployment state CREATED
[2023-12-04T00:47:05,739][INFO ][o.o.m.e.a.DLModel        ] [ip-172-31-56-214] Model G-ZKMowB5L1fWsD05WKj is successfully deployed on 1 devices
[2023-12-04T00:47:05,754][INFO ][o.o.m.a.f.TransportForwardAction] [ip-172-31-56-214] deploy model done with state: DEPLOYED, model id: G-ZKMowB5L1fWsD05WKj
[2023-12-04T00:47:05,755][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [ip-172-31-56-214] deploy model task done HOZKMowB5L1fWsD0-WIw
[2023-12-04T00:47:08,483][INFO ][o.o.f.w.AbstractRetryableWorkflowStep] [ip-172-31-56-214] Deploy model successful for GOZKMowB5L1fWsD0wWJY and modelId G-ZKMowB5L1fWsD05WKj
[2023-12-04T00:47:08,496][INFO ][o.o.f.i.FlowFrameworkIndicesHandler] [ip-172-31-56-214] updated resources created of GOZKMowB5L1fWsD0wWJY
[2023-12-04T00:47:08,496][INFO ][o.o.f.w.AbstractRetryableWorkflowStep] [ip-172-31-56-214] successfully updated resources created in state index: .plugins-workflow-state
[2023-12-04T00:47:08,497][INFO ][o.o.f.t.ProvisionWorkflowTransportAction] [ip-172-31-56-214] Provisioning completed successfully for workflow GOZKMowB5L1fWsD0wWJY
[2023-12-04T00:47:08,497][INFO ][o.o.f.w.ProcessNode      ] [ip-172-31-56-214] Finished workflow_step_3.
[2023-12-04T00:47:08,509][INFO ][o.o.f.t.ProvisionWorkflowTransportAction] [ip-172-31-56-214] updated workflow GOZKMowB5L1fWsD0wWJY state to COMPLETED

@owaiskazi19
Copy link
Member Author

@dbwiddis got slf4j warnings in the log. I think you fixed this before or not? Not sure.

[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: Defaulting to no-operation (NOP) logger implementation
[2023-12-04T00:46:58,485][WARN ][stderr                   ] [ip-172-31-56-214] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

@owaiskazi19 owaiskazi19 self-assigned this Dec 4, 2023
@dbwiddis
Copy link
Member

dbwiddis commented Dec 4, 2023

@dbwiddis got slf4j warnings in the log. I think you fixed this before or not? Not sure.

This happens if somewhere in the dependency hierarchy you are importing slf4j-api. It's automatically using the slf4j-noop dependency, which means it won't log anything from whatever dependency is using slf4j. See the site referred to in the warning: https://www.slf4j.org/codes.html#StaticLoggerBinder

You can:

  1. Ignore the log error message
  2. Manually install that noop dependency to make that warning message go away and preserve the no-op behavior
  3. Install the logging implementation that redirects slf4j logging to log4j2, to get logging for that dependency
  4. Exclude the transitive slf4j dependency, although I'm not sure what happens if you invoke a logging call in the upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants