Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Scaling on AWS #117

Merged
merged 10 commits into from
May 15, 2024
Merged

[WIP] Scaling on AWS #117

merged 10 commits into from
May 15, 2024

Conversation

suhlrich
Copy link
Member

This closes #113 .

TODO AWS CLI

  1. Get machine ID
  2. send shutdown message

turn off backend machine after 5 mins of no trial and there is nothing recording

utilsAPI.py Outdated Show resolved Hide resolved
comments + addressing comment alex
@antoinefalisse antoinefalisse changed the title [WIP] Shut down AWS machine from app.py [WIP] Scaling on AWS May 13, 2024
@antoinefalisse antoinefalisse changed the base branch from main to dev May 13, 2024 17:03
@antoinefalisse
Copy link
Collaborator

antoinefalisse commented May 13, 2024

@sashasimkin and @suhlrich could you take a look at this? I added the boto3 scripts and adjusted a couple of things.

  • I am not sure this is what we want here. I understand that if the number of trials is under a certain threshold (like the number of on-prem workers times a number of trials per worker) then we want to remove the scale-in protection. I think this is what this will do, is that what you had in mind?
  • I don't think this will ever be triggered, except if we don't use on-prem workers (and we want this to work). But the idea here is to remove the scale-in protection if the instance hasn't processed a trial for more than x minutes and if there aren't trials being recorded.

Copy link
Member Author

@suhlrich suhlrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

t_lastTrial = time.localtime()

if autoScalingInstance and justProcessed:
justProcessed = False

This comment was marked as off-topic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sashasimkin as I mentioned in the comment, I don't think that part should ever be triggered when we have on-prem workers since there should be no case when that status is 404. However, I would like that part to work, and us to test it, when there are no no-prem workers. The intended behavior is NOT to process only 1 trial by every single worker machine. The intention is to remove the scale-in protection after a certain time minutesBeforeRemoveScaleInProtection and if there aren't trials being recorded. If there is a queue, then the justProcessed variable gets set again and again to True and the scale-in protection is not removed. Does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand why we would need to move this there:

only to move if r.status_code == 404: block before if with_on_prem:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain what would be the reason for r.status_code == 404? I assumed that this is returned when there's no work to do.

From what I see, it seems that setting "max_on_prem_pending_trials" to 0 when running without on prem workers will do exactly what you want, if you lift the "with_on_prem" block and run the check in the main loop?

re. "The intended behavior is NOT to process only 1 trial by every single worker machine." - my bad, I didn't read the code fully, sorry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain what would be the reason for r.status_code == 404? I assumed that this is returned when there's no work to do.

Yes, exactly. If there are no job, then it returns 404 and start over until there is a job to process

From what I see, it seems that setting "max_on_prem_pending_trials" to 0 when running without on prem workers will do exactly what you want, if you lift the "with_on_prem" block and run the check in the main loop?

I'm not sure I follow this. Are you suggesting to move this somewhere else?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to not have it at all as well as not have "if with_on_prem:" condition for unprotecting to simplify the code.

But I'm now not sure that I understand the whole logic, so I'll resolve this comment now and will try to explain the point on the call today.

utilsAPI.py Outdated Show resolved Hide resolved
@antoinefalisse antoinefalisse merged commit 44042a0 into dev May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backend scaling -- stop instance
3 participants