Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding mmpose #17

Open
wants to merge 23 commits into
base: feature/ecs-on-ec2-auto-scaling
Choose a base branch
from

Conversation

antoinefalisse
Copy link
Collaborator

@antoinefalisse antoinefalisse commented Jun 6, 2024

@sashasimkin @suhlrich @olehkorkh-planeks I could make the ASG work, or at least partly. Here are the TODOs/Bugs:

  • MMPOSE is not set in task definition

    • For the core algorithm to run, we need three containers: OpenCap, OpenPose, and MMPose. The existing task definition does not include MMPose. So I added it through this PR, basically copying what we do for OpenPose (FYI OpenPose and MMPose are two containers basically doing the same thing, but using different models). However, the tasks are not starting in the processing cluster. After >15min, they are still pending/provisioning. I wondered if it was a problem with memory, because the mmpose image on ECR is 8931.13 Mb and the three images combined are over 15Gb. I tried using g5.2xlarge and doubling the memory, but I had the same issue (+ we actually have a g5.xlarge on EC2 running the core algorithm and we have no memory problem so I really don't think memory is an issue). Do you see something missing in my PR that could explain why it does not start?
  • ECS_CONTAINER_METADATA_FILE is not set

    • The instance is not recognized as an ECS instance. I temporarily commented out most of the function and returned True to keep doing some testing, but when un-commented it was going there meaning the ECS_CONTAINER_METADATA_FILE was not set. ChatGPT suggested making this change in the task_definition, but it did not help. Any thoughts on what is going wrong?
  • Two tasks start at once, shouldn't it be one by one?

  • When two tasks are active, they never stop even if the instances are unprotected.

    • Here are the logs from my tests. Tasks d6a and a3d were unprotected for 60min while the # of pending trials was < the target, but the tasks never stopped. It is unclear why a 3rd task started, the max size should be 2. Interestingly, when 3 tasks were active, one (cda) actually stopped after being unprotected. My overall impression is that there is something wrong somewhere regarding the max/min/desired number of tasks. 1) It starts two at the same time, which is not expected. 2) It goes higher than 2 although the max size is set to 2. 3) It does not stop tasks when 2 are running, but it does when 3 are running.
      • 13:22 - Task cda started
      • 13:22 - Task d6a started
      • 13:29 - Task cda: Instance unprotected
      • 13:30 - Task 21a started
        • Why? Max size should be 2 and no tasks stopped yet
        • It seems to have started right after the instance of task cda got unprotected.
      • 13:32 - Task d6a: Instance unprotected
      • 13:36 - Task 21a crashed
        • Status check failed (cp: cannot stat ‘/data/output_openpose/*’: No such file or directory)
        • The task seems to have stopped by itself, but not reported under events
        • That would be a nice feature if that happened
      • 13:37 - Task a3d started
        • Why? Max size should be 2.
        • It seems to have started right after the instance of task 21a crashed
      • 13:38 - Task a3d Instance unprotected
      • 13:38 - Task cda stopped
      • 14:32 - Task d6a woke up, which is exactly 1h after it was “put to sleep”
      • 14:38 - Task a3d woke up, which is exactly 1h after it was “put to sleep”
  • Need to unprotect instance if something makes core crash (Antoine)

    • Eg, here
    • We also have this API error sometimes. Let's make sure we capture it and unprotect the instance. I am pretty sure now it just crashes and the trial remains with "processing" status

@sashasimkin
Copy link
Collaborator

Hi @antoinefalisse !

However, the tasks are not starting in the processing cluster.

Is this still an issue? I see that later messages specify that the worker has actually started and processed stuff.
Were later messages from tests without MMPOSE?

In general, I saw this behavior when the container instance didn't have enough memory to place the task.
Likely, deploying another container requires more runtime memory used by docker itself.

In any case, I've improved it here, to make this behavior more predictable and adjustable: e802650

ECS_CONTAINER_METADATA_FILE is not set

My bad :( fixed in 0859046 , you can revert the hotfix in opencap-core

Two tasks start at once, shouldn't it be one by one?

Must be because of maximum_scaling_step_size = 2, fixed in 09b146e.

When two tasks are active, they never stop even if the instances are unprotected.

I saw this, which suggests you tried to scale instances manually. I can't explain how, but my gut feeling suggests this might've caused three instances. Let's have another test please with the above fixes, especially the unprotect-on-crash things.

image

What else is extremely weird is that I don't see the AlarmLow ever getting triggered.

I'll look into this instance more, but we'll need another test definitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants