Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds transport request retry capability for GetMLTaskStep #179

Merged
merged 10 commits into from
Nov 23, 2023

Conversation

joshpalis
Copy link
Member

@joshpalis joshpalis commented Nov 20, 2023

Description

Adds an updateable FlowFrameworkMaxRequestRetry setting to control the number of request retries for workflow steps that invoke an async API.

In this case for a local model registration workflow, the GetMLTaskStep will first attempt to retrieve the status of the given task. If the task is not Completed, then the step will wait before retrying the request until the status is Completed. Only then will the model_id be available

Steps to test :

  1. Enable the flow framework plugin APIs :
curl -i -XPUT "localhost:9200/_cluster/settings" -H "Content-Type:application/json" --data '{"transient":{"plugins.flow_framework.enabled":true}}'
HTTP/1.1 200 OK
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 99

{"acknowledged":true,"persistent":{},"transient":{"plugins":{"flow_framework":{"enabled":"true"}}}}%     
  1. Enable local model registration via URL and enable model registration to be handled by non-ml nodes :
curl -i -XPUT "localhost:9200/_cluster/settings" -H "Content-Type:application/json" --data '{"persistent":{"plugins.ml_commons.allow_registering_model_via_url":true,"plugins.ml_commons.only_run_on_ml_node":false}}'

HTTP/1.1 200 OK
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 149

{"acknowledged":true,"persistent":{"plugins":{"ml_commons":{"only_run_on_ml_node":"false","allow_registering_model_via_url":"true"}}},"transient":{}}
  1. Create a workflow (Create Model Group -> register local model -> get ML task -> deploy model ). GetMLTaskStep includes an increased node_timeout due to the length of time it takes to register a local model
curl -i -XPOST "localhost:9200/_plugins/_flow_framework/workflow" -H "Content-Type:application/json" --data '{"name":"registermodelgroup-registerlocalmodel-getmltask-deploymodel","description":"test case","use_case":"TEST_CASE","version":{"template":"1.0.0","compatibility":["2.12.0","3.0.0"]},"workflows":{"provision":{"nodes":[{"id":"workflow_step_1","type":"model_group","user_inputs":{"name":"my-model-group"}},{"id":"workflow_step_2","type":"register_local_model","previous_node_inputs":{"workflow_step_1":"model_group_id"},"user_inputs":{"node_timeout":"60s","name":"all-MiniLM-L6-v2","version":"1.0.0","description":"test model","model_format":"TORCH_SCRIPT","model_content_hash_value":"c15f0d2e62d872be5b5bc6c84d2e0f4921541e29fefbef51d59cc10a8ae30e0f","model_type":"bert","embedding_dimension":"384","framework_type":"sentence_transformers","all_config":"{\"_name_or_path\":\"nreimers/MiniLM-L6-H384-uncased\",\"architectures\":[\"BertModel\"],\"attention_probs_dropout_prob\":0.1,\"gradient_checkpointing\":false,\"hidden_act\":\"gelu\",\"hidden_dropout_prob\":0.1,\"hidden_size\":384,\"initializer_range\":0.02,\"intermediate_size\":1536,\"layer_norm_eps\":1e-12,\"max_position_embeddings\":512,\"model_type\":\"bert\",\"num_attention_heads\":12,\"num_hidden_layers\":6,\"pad_token_id\":0,\"position_embedding_type\":\"absolute\",\"transformers_version\":\"4.8.2\",\"type_vocab_size\":2,\"use_cache\":true,\"vocab_size\":30522}","url":"https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip"}},{"id":"workflow_step_3","type":"get_ml_task","previous_node_inputs":{"workflow_step_2":"task_id"}},{"id":"workflow_step_4","type":"deploy_model","previous_node_inputs":{"workflow_step_3":"model_id"}}],"edges":[{"source":"workflow_step_1","dest":"workflow_step_2"},{"source":"workflow_step_2","dest":"workflow_step_3"},{"source":"workflow_step_3","dest":"workflow_step_4"}]}}}'
HTTP/1.1 201 Created
X-OpenSearch-Version: OpenSearch/3.0.0-SNAPSHOT (opensearch)
content-type: application/json; charset=UTF-8
content-length: 38

{"workflow_id":"uR-k7osBceShxfTxn2M4"
  1. Provision workflow is triggered, model group creation, local model registration, task retrieval and model deployment completed successfully :
[2023-11-20T21:31:12,646][INFO ][o.o.f.w.ModelGroupStep   ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Model group registration successful
[2023-11-20T21:31:12,648][INFO ][o.o.f.w.ProcessNode      ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Finished workflow_step_1.
[2023-11-20T21:31:12,648][INFO ][o.o.f.w.ProcessNode      ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Starting workflow_step_2.
[2023-11-20T21:31:12,658][INFO ][stdout                   ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] registering the model
[2023-11-20T21:31:12,664][INFO ][o.o.p.PluginsService     ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] PluginService:onIndexModule index:[.plugins-ml-task/3UvftLwYSiOQ3sg5MeUUlw]
[2023-11-20T21:31:12,674][INFO ][o.o.c.m.MetadataCreateIndexService] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] [.plugins-ml-task] creating index, cause [api], templates [], shards [1]/[1]
[2023-11-20T21:31:12,675][INFO ][o.o.c.r.a.AllocationService] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] updating number_of_replicas to [0] for indices [.plugins-ml-task]
[2023-11-20T21:31:12,696][INFO ][o.o.p.PluginsService     ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] PluginService:onIndexModule index:[.plugins-ml-task/3UvftLwYSiOQ3sg5MeUUlw]
[2023-11-20T21:31:12,745][INFO ][o.o.c.r.a.AllocationService] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.plugins-ml-task][0]]]).
[2023-11-20T21:31:12,763][INFO ][o.o.m.i.MLIndicesHandler ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] create index:.plugins-ml-task
[2023-11-20T21:31:12,777][INFO ][o.o.f.w.RegisterLocalModelStep] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Local Model registration task creation successful
[2023-11-20T21:31:12,778][INFO ][o.o.f.w.ProcessNode      ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Finished workflow_step_2.
[2023-11-20T21:31:12,778][INFO ][o.o.f.w.ProcessNode      ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Starting workflow_step_3.
[2023-11-20T21:31:12,811][INFO ][o.o.p.PluginsService     ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] PluginService:onIndexModule index:[.plugins-ml-model/VsGHpDA4QQ2BCwGeXUe86w]
[2023-11-20T21:31:12,820][INFO ][o.o.c.m.MetadataCreateIndexService] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] [.plugins-ml-model] creating index, cause [api], templates [], shards [1]/[1]
[2023-11-20T21:31:12,821][INFO ][o.o.c.r.a.AllocationService] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] updating number_of_replicas to [0] for indices [.plugins-ml-model]
[2023-11-20T21:31:12,840][INFO ][o.o.p.PluginsService     ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] PluginService:onIndexModule index:[.plugins-ml-model/VsGHpDA4QQ2BCwGeXUe86w]
[2023-11-20T21:31:12,884][INFO ][o.o.c.r.a.AllocationService] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.plugins-ml-model][0]]]).
[2023-11-20T21:31:12,903][INFO ][o.o.m.i.MLIndicesHandler ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] create index:.plugins-ml-model
[2023-11-20T21:31:12,921][INFO ][o.o.m.m.MLModelManager   ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] create new model meta doc vB-l7osBceShxfTxAGOJ for register model task ux-k7osBceShxfTx_2P8
Downloading: 100% |████████████████████████████████████████| all-MiniLM-L6-v2.zipc8aa11.us-west-2.amazon.com] 
[2023-11-20T21:31:17,378][INFO ][o.o.m.m.MLModelManager   ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Model registered successfully, model id: vB-l7osBceShxfTxAGOJ, task id: ux-k7osBceShxfTx_2P8
[2023-11-20T21:31:17,787][INFO ][o.o.f.w.GetMLTaskStep    ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] ML Task retrieval successful
[2023-11-20T21:31:17,788][INFO ][o.o.f.w.ProcessNode      ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Finished workflow_step_3.
[2023-11-20T21:31:17,788][INFO ][o.o.f.w.ProcessNode      ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Starting workflow_step_4.
[2023-11-20T21:31:17,796][INFO ][o.o.m.a.d.TransportDeployModelAction] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Will deploy model on these nodes: A-f7kprKSmK3fnxkjh0keA
[2023-11-20T21:31:17,812][INFO ][o.o.f.w.DeployModelStep  ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Model deployment state CREATED
[2023-11-20T21:31:17,813][INFO ][o.o.f.t.ProvisionWorkflowTransportAction] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Provisioning completed successfully for workflow uR-k7osBceShxfTxn2M4
[2023-11-20T21:31:17,813][INFO ][o.o.f.w.ProcessNode      ] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] Finished workflow_step_4.
[2023-11-20T21:31:17,830][INFO ][o.o.f.t.ProvisionWorkflowTransportAction] [dev-dsk-jpalis-2c-27c8aa11.us-west-2.amazon.com] updated workflow uR-k7osBceShxfTxn2M4 state to COMPLETED

Issues Resolved

Actioning issue : #158
Part of #88

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Nov 20, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (d3ab0ef) 72.12% compared to head (f98404c) 72.24%.

Files Patch % Lines
...ensearch/flowframework/workflow/GetMLTaskStep.java 85.18% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #179      +/-   ##
============================================
+ Coverage     72.12%   72.24%   +0.12%     
- Complexity      407      413       +6     
============================================
  Files            53       54       +1     
  Lines          1948     1971      +23     
  Branches        164      165       +1     
============================================
+ Hits           1405     1424      +19     
- Misses          471      474       +3     
- Partials         72       73       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Joshua Palis <[email protected]>
Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see the retry functionality! Much needed. Few comments regarding the implementation.

… to initialize the setting update consumer for max retry request setting, passing settings and clusterservice to retryable workflow step instead

Signed-off-by: Joshua Palis <[email protected]>
Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Keeping the comment unresolved for Dan to answer. I'm assuming retry will be tested with integ tests?

…oving max value from setting, cancelling future on thread interuppt

Signed-off-by: Joshua Palis <[email protected]>
Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dbwiddis dbwiddis merged commit 1092325 into opensearch-project:main Nov 23, 2023
19 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Nov 23, 2023
* Added FlowFrameworkMaxRequestRetrySetting and applied this to GetMlTaskStep

Signed-off-by: Joshua Palis <[email protected]>

* Addressing PR comments, creating abstract class RetryableWorkflowStep to initialize the setting update consumer for max retry request setting, passing settings and clusterservice to retryable workflow step instead

Signed-off-by: Joshua Palis <[email protected]>

* Addressing PR comments

Signed-off-by: Joshua Palis <[email protected]>

* Removing retry utils class

Signed-off-by: Joshua Palis <[email protected]>

* Fixing failure unit tests to ensure thread sleep isnt invoked

Signed-off-by: Joshua Palis <[email protected]>

* Addressing PR comments, changing setting name, improving javadoc, removing max value from setting, cancelling future on thread interuppt

Signed-off-by: Joshua Palis <[email protected]>

---------

Signed-off-by: Joshua Palis <[email protected]>
(cherry picked from commit 1092325)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Nov 28, 2023
* Added FlowFrameworkMaxRequestRetrySetting and applied this to GetMlTaskStep

Signed-off-by: Joshua Palis <[email protected]>

* Addressing PR comments, creating abstract class RetryableWorkflowStep to initialize the setting update consumer for max retry request setting, passing settings and clusterservice to retryable workflow step instead

Signed-off-by: Joshua Palis <[email protected]>

* Addressing PR comments

Signed-off-by: Joshua Palis <[email protected]>

* Removing retry utils class

Signed-off-by: Joshua Palis <[email protected]>

* Fixing failure unit tests to ensure thread sleep isnt invoked

Signed-off-by: Joshua Palis <[email protected]>

* Addressing PR comments, changing setting name, improving javadoc, removing max value from setting, cancelling future on thread interuppt

Signed-off-by: Joshua Palis <[email protected]>

---------

Signed-off-by: Joshua Palis <[email protected]>
(cherry picked from commit 1092325)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
joshpalis pushed a commit that referenced this pull request Nov 28, 2023
…ility for GetMLTaskStep (#205)

Adds transport request retry capability for GetMLTaskStep (#179)

* Added FlowFrameworkMaxRequestRetrySetting and applied this to GetMlTaskStep



* Addressing PR comments, creating abstract class RetryableWorkflowStep to initialize the setting update consumer for max retry request setting, passing settings and clusterservice to retryable workflow step instead



* Addressing PR comments



* Removing retry utils class



* Fixing failure unit tests to ensure thread sleep isnt invoked



* Addressing PR comments, changing setting name, improving javadoc, removing max value from setting, cancelling future on thread interuppt



---------


(cherry picked from commit 1092325)

Signed-off-by: Joshua Palis <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants