-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Scaling on AWS #117
[WIP] Scaling on AWS #117
Conversation
comments + addressing comment alex
@sashasimkin and @suhlrich could you take a look at this? I added the boto3 scripts and adjusted a couple of things.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
t_lastTrial = time.localtime() | ||
|
||
if autoScalingInstance and justProcessed: | ||
justProcessed = False |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sashasimkin as I mentioned in the comment, I don't think that part should ever be triggered when we have on-prem workers since there should be no case when that status is 404. However, I would like that part to work, and us to test it, when there are no no-prem workers. The intended behavior is NOT to process only 1 trial by every single worker machine. The intention is to remove the scale-in protection after a certain time minutesBeforeRemoveScaleInProtection
and if there aren't trials being recorded. If there is a queue, then the justProcessed
variable gets set again and again to True
and the scale-in protection is not removed. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand why we would need to move this there:
only to move if r.status_code == 404: block before if with_on_prem:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain what would be the reason for r.status_code == 404
? I assumed that this is returned when there's no work to do.
From what I see, it seems that setting "max_on_prem_pending_trials" to 0 when running without on prem workers will do exactly what you want, if you lift the "with_on_prem" block and run the check in the main loop?
re. "The intended behavior is NOT to process only 1 trial by every single worker machine." - my bad, I didn't read the code fully, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain what would be the reason for r.status_code == 404? I assumed that this is returned when there's no work to do.
Yes, exactly. If there are no job, then it returns 404 and start over until there is a job to process
From what I see, it seems that setting "max_on_prem_pending_trials" to 0 when running without on prem workers will do exactly what you want, if you lift the "with_on_prem" block and run the check in the main loop?
I'm not sure I follow this. Are you suggesting to move this somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to not have it at all as well as not have "if with_on_prem:" condition for unprotecting to simplify the code.
But I'm now not sure that I understand the whole logic, so I'll resolve this comment now and will try to explain the point on the call today.
This closes #113 .
TODO AWS CLI
turn off backend machine after 5 mins of no trial and there is nothing recording