Adding mmpose #17

antoinefalisse · 2024-06-06T00:03:02Z

@sashasimkin @suhlrich @olehkorkh-planeks I could make the ASG work, or at least partly. Here are the TODOs/Bugs:

MMPOSE is not set in task definition
- For the core algorithm to run, we need three containers: OpenCap, OpenPose, and MMPose. The existing task definition does not include MMPose. So I added it through this PR, basically copying what we do for OpenPose (FYI OpenPose and MMPose are two containers basically doing the same thing, but using different models). However, the tasks are not starting in the processing cluster. After >15min, they are still pending/provisioning. I wondered if it was a problem with memory, because the mmpose image on ECR is 8931.13 Mb and the three images combined are over 15Gb. I tried using g5.2xlarge and doubling the memory, but I had the same issue (+ we actually have a g5.xlarge on EC2 running the core algorithm and we have no memory problem so I really don't think memory is an issue). Do you see something missing in my PR that could explain why it does not start?
ECS_CONTAINER_METADATA_FILE is not set
- The instance is not recognized as an ECS instance. I temporarily commented out most of the function and returned True to keep doing some testing, but when un-commented it was going there meaning the ECS_CONTAINER_METADATA_FILE was not set. ChatGPT suggested making this change in the task_definition, but it did not help. Any thoughts on what is going wrong?
Two tasks start at once, shouldn't it be one by one?
When two tasks are active, they never stop even if the instances are unprotected.
- Here are the logs from my tests. Tasks d6a and a3d were unprotected for 60min while the # of pending trials was < the target, but the tasks never stopped. It is unclear why a 3rd task started, the max size should be 2. Interestingly, when 3 tasks were active, one (cda) actually stopped after being unprotected. My overall impression is that there is something wrong somewhere regarding the max/min/desired number of tasks. 1) It starts two at the same time, which is not expected. 2) It goes higher than 2 although the max size is set to 2. 3) It does not stop tasks when 2 are running, but it does when 3 are running.
  - 13:22 - Task cda started
  - 13:22 - Task d6a started
  - 13:29 - Task cda: Instance unprotected
  - 13:30 - Task 21a started
    - Why? Max size should be 2 and no tasks stopped yet
    - It seems to have started right after the instance of task cda got unprotected.
  - 13:32 - Task d6a: Instance unprotected
  - 13:36 - Task 21a crashed
    - Status check failed (cp: cannot stat ‘/data/output_openpose/*’: No such file or directory)
    - The task seems to have stopped by itself, but not reported under events
    - That would be a nice feature if that happened
  - 13:37 - Task a3d started
    - Why? Max size should be 2.
    - It seems to have started right after the instance of task 21a crashed
  - 13:38 - Task a3d Instance unprotected
  - 13:38 - Task cda stopped
  - 14:32 - Task d6a woke up, which is exactly 1h after it was “put to sleep”
  - 14:38 - Task a3d woke up, which is exactly 1h after it was “put to sleep”
Need to unprotect instance if something makes core crash (Antoine)
- Eg, here
- We also have this API error sometimes. Let's make sure we capture it and unprotect the instance. I am pretty sure now it just crashes and the trial remains with "processing" status

re. !17 , see aws/amazon-ecs-agent#1514 (comment) for fix details

re. !17

sashasimkin · 2024-06-07T08:58:00Z

Hi @antoinefalisse !

However, the tasks are not starting in the processing cluster.

Is this still an issue? I see that later messages specify that the worker has actually started and processed stuff.
Were later messages from tests without MMPOSE?

In general, I saw this behavior when the container instance didn't have enough memory to place the task.
Likely, deploying another container requires more runtime memory used by docker itself.

In any case, I've improved it here, to make this behavior more predictable and adjustable: e802650

ECS_CONTAINER_METADATA_FILE is not set

My bad :( fixed in 0859046 , you can revert the hotfix in opencap-core

Two tasks start at once, shouldn't it be one by one?

Must be because of maximum_scaling_step_size = 2, fixed in 09b146e.

When two tasks are active, they never stop even if the instances are unprotected.

I saw this, which suggests you tried to scale instances manually. I can't explain how, but my gut feeling suggests this might've caused three instances. Let's have another test please with the above fixes, especially the unprotect-on-crash things.

What else is extremely weird is that I don't see the AlarmLow ever getting triggered.

I'll look into this instance more, but we'll need another test definitely.

…2-auto-scaling-updates

re. !17

…2-auto-scaling-updates

We need to share signle GPU between openpose & mmpose re. !17 , aws/containers-roadmap#327 (comment)

try a modern approach for configuring docker + set special env variable to make all GPUs visible re. !17

re. !17

antoinefalisse and others added 6 commits June 5, 2024 16:33

support for mmpose on asg

9cb27ea

typo

7d0f3e5

ecs-on-ec2-auto-scaling-updates

1f2445c

fix: fix ECS_CONTAINER_METADATA_FILE is not set

0859046

re. !17 , see aws/amazon-ecs-agent#1514 (comment) for fix details

fix: set ECS_RESERVED_MEMORY to 768 to make placement more predictable

e802650

re. !17

fix: Scale by one instance, not 2

09b146e

re. !17

antoinefalisse and others added 17 commits June 7, 2024 07:00

fix conflicts

92eb747

fix conflicts

056f12a

Merge branch 'feature/ecs-on-ec2-auto-scaling' into feature/ecs-on-ec…

2b64aa5

…2-auto-scaling-updates

bring back some changes

ffc0ff0

fix: Adjust memory reservation again

ad3b532

re. !17

Merge branch 'feature/ecs-on-ec2-auto-scaling' into feature/ecs-on-ec…

cdb9761

…2-auto-scaling-updates

Merge branch 'feature/ecs-on-ec2-auto-scaling' into feature/ecs-on-ec…

bf7d8a2

…2-auto-scaling-updates

minor

736be49

minor

59ed567

only one GPU

770b7cc

feat: Implement GPU sharing suggested by GH users

7c81950

We need to share signle GPU between openpose & mmpose re. !17 , aws/containers-roadmap#327 (comment)

fix: Remove old launch configuration code

72c5492

wip

605c11e

fix: Update nvidia GPU sharing

6c2d55a

try a modern approach for configuring docker + set special env variable to make all GPUs visible re. !17

typo

ba11800

fix: Resolt to having only mmpose with GPU available

0b3697b

re. !17

fix: Persist configuration that has mmpose and openpose working together

7f8ef96

re. !17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding mmpose #17

Adding mmpose #17

antoinefalisse commented Jun 6, 2024 •

edited

Loading

sashasimkin commented Jun 7, 2024

Adding mmpose #17

Are you sure you want to change the base?

Adding mmpose #17

Conversation

antoinefalisse commented Jun 6, 2024 • edited Loading

sashasimkin commented Jun 7, 2024

antoinefalisse commented Jun 6, 2024 •

edited

Loading