Task definition caching does not work if the task definitions come from separate deployments. #368

taylor-curran · 2024-01-16T22:31:38Z

Expectation / Proposal

We just moved from defining our own task-definition for prefect-ecs-task, to letting prefect defining them by itself in runtime.

After the change, we seem to have an issue where prefect defines a task_definition per task. Error from aws:

botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Too many concurrent attempts to create a new revision of the specified family.

Task-definition diff is:

taskDefinitionArn , the revision-number is changed
revision , the revision-number is changed
registeredAt , the timestamp

The main think I see I can change in the work_pool-config regarding this, is: Auto Deregister Task Definition (Optional) , we have tried with this both enabled and disabled.

When doing a simple test like, where the same deployment is deployed multiple times, it runs ok, and same task_definition is used, both with and with out Auto Deregister Task Definition (Optional):

#!/usr/bin/env bash

_run_prefect_task() {
    echo "Trigger task: $1"
    poetry run prefect deployment run 'ecs-task-flow/snowflake_access_test'
    echo "Task done: $1"
}

_run_task_in_parallel() {
    CONCURRENT_RUNS="$1"
    for i in $(seq 1 "$CONCURRENT_RUNS"); do
        _run_prefect_task $i &
    done
    echo "Waiting for tasks to complete"
    wait
}

_run_task_in_parallel 200

Traceback / Example

I would like to help contribute a pull request to resolve this!

In the case where we are being throttled, we have different deployments using the same image/image-tag, and then could also use the same task-definition.

From looking at: prefect_aws/workers/ecs_worker.py#L735 You can see that we cache the task definition to avoid the over-registering of task definitions, but for this user in particular, they have too many different deployments.

We might have a situation, where what is used as cache-key is the deployment-id, so that results in the same task-definition is added to cache multiple times, and then being created multiple times. Maybe we could use a hash from the task-definition as cache-key, together with some other values from the worker/deployment, but not the deployment-id?

The text was updated successfully, but these errors were encountered:

mch-sb · 2024-01-17T08:45:47Z

A small example of when this issue presents an issue for us:

from prefect import flow, deploy
import asyncio
from prefect.deployments.deployments import run_deployment

@flow(log_prints=True)
def hello(deployment_index):
    print(f'hello {deployment_index}')


async def main():
    deployments=[]
    for x in range(200):
        deployments.append(
            await hello.to_deployment(
                name=f"deplyment-{x}",
                parameters=dict(
                    deployment_index=x,
                ),
            )
        )
    deployed_deployments = await deploy(
        *deployments,
        image="same image for all deployments",
        work_pool_name="same workpool for all deployments",
        build=False,
    )
    for uuid in deployed_deployments:
        await run_deployment(
            name=uuid,
            timeout=0,
        )

if __name__ == "__main__":
    asyncio.run(main())

Basically, a set of deployments, different parameters, but same image and work_pool, which results in one task_definition revision created per deployment, and we get the ClientException.

mbgrabbe · 2024-01-24T15:04:34Z

I ran into this very same issue yesterday while trying out ECS worker pool deployments. This is not a problem we see today with our infra block agent deployments.

kevingrismore mentioned this issue Jan 22, 2024

Add ECS worker option to use most recent revision in task definition family #370

Merged

5 tasks

kevingrismore closed this as completed in #370 Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task definition caching does not work if the task definitions come from separate deployments. #368

Task definition caching does not work if the task definitions come from separate deployments. #368

taylor-curran commented Jan 16, 2024

mch-sb commented Jan 17, 2024

mbgrabbe commented Jan 24, 2024

Task definition caching does not work if the task definitions come from separate deployments. #368

Task definition caching does not work if the task definitions come from separate deployments. #368

Comments

taylor-curran commented Jan 16, 2024

Expectation / Proposal

Traceback / Example

mch-sb commented Jan 17, 2024

mbgrabbe commented Jan 24, 2024