-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-scaling questions #21
Comments
@sashasimkin I am thinking about a case where like 5 tasks had started, then two got stopped. Yet there are still a lot of pending trials. Since we want most processing handled with ASG, this value will be low and this condition will not be met. Therefore the instances corresponding to the tasks that stopped are still protected and will not stop (maybe for a while) until there a very few trials in the queue and this condition is met. |
@sashasimkin |
Hey @antoinefalisse, sorry for the delay here.
There are multiple, with various efforts and ROI. The biggest problem here is that we're dealing with two layers - EC2 and ECS, each using its own scaling metrics. The simplest that I see is having a scheduled scaling policy, that sets minimum=1 at the start of the day (where trials possibly might come) and return minimum back to 0 at the end of the day. An additional option is using EC2 Warm Pools, but that requires much more effort to implement(implementing lifecycle hooks, pre-fetching the Docker Image on the machine), with likely bigger ROI. Another option that we can try is using step scaling policy, instead of target tracking, which is more responsive, however this will only affect the first leg of scaling - the ECS tasks, while EC2 instances will still have 3 datapoints within 3 minutes. Answers to specific questions
No, these alarms are managed by AWS Auto Scaling and can't be edited.
This is pretty good actually - it took AWS to start an EC2 instance and have a running task on it.
I'm not sure if this is affecting the actual job executions. Steady state is when the running tasks == desired tasks and healthchecks are passing.
Back at the previous point - please confirm that it actually took >11m to start processing, please measure the time e2e from trial submitted to the moment worker started processing the trial. We can also try twice bigger(in terms of the number of GPUs) instances and have two tasks per instance, which, in theory, should speed up a cold boot and dramatically improve the second boot.
AWS Auto Scaling prioritizes availability, so this is pretty much up to AWS, which can be regarded random :)
Same as ^, as we rely on AutoScaling repeatedly attempt to terminate the instance after un-protection, it's up to AWS and chances between metric changes and instance un-protection.
In general they don't affect much the scaling, so I'll reply only to a relevant part.
It shouldn't, but we can decrease it anyway. This is the 300 seconds instance warmup logic basically.
We can't, because if we terminate the instance AWS will spawn a new one. I'll need a bit more time to answer to the rest of the questions. |
Also, as an extra comment to the scale-in speed: It can be more responsive if we used https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-scale-in-protection.html, but that sounds like too little ROI.
It will keep running and processing, until pending_trials (how many there are in queue) is less than onprem capacity. Sorry, I'm not following another example - e.g. why the tasks were stopped if there are alot of pending trials in the queue?
I thought that the idea was that you wanted most processing to happen within OnPrem workers and have ASG join-in when there are more trials than OnPrem capacity?
This is to account for OnPrem workers, so that they keep processing a known amount of trials. Also, right now the logic is set to scale out an I see that screenshots are from ModelHealth, and you mentioned at some point that you don't have OnPrem capacity for that project. In that case - it definitely makes sense to have To summarize everything:
Let me know your thoughs on the above, but also what can be useful is to understand your expectations and general bounds of what in your opinion is fast, what is slow already and what is acceptable. |
@sashasimkin thanks, this is very helpful. I suggest we first get this deployed and see how that works. Then we can iterate on what would make most sense to optimize speed. |
The current implementation of auto-scaling is working well. I am wondering if there are ways to get things to go faster though, especially scale out (I am currently more concerned about people waiting for their data to be processed than paying extra because instances are waiting to be terminated). Here is a set of related questions.
Starting task
We need 3 datapoints with 3 minutes, which means it will take at least 3 minutes for the task to start. Is there a way to change that (eg, 2 datapoints with 2 minutes)?
As you can see in the screenshot below, there is another 3 minutes between service worker has started 1 tasks: task 18d413914fce4bec9f6eb8fd3de3b007 and service worker, taskSet ecs-svc/9445821085410031600) has started 1 tasks: task 18d413914fce4bec9f6eb8fd3de3b007. Is there anything that could be optimized there to make things faster?
As you can see in the screenshot below, there is another 5 minutes for the service worker to reach a steady state. Is there anything that could be optimized there to make things faster?
Overall, it took >11minutes for the instance to start processing trials after the number of pending trials exceeded the threshold. Is there anything that could be done there to make things faster? Eg, more memory, CPUs, decreased buffers, etc.
Stopping task
Buffers
Are there any buffers that I missed that could explain some of the delays I reported above?
Others
Screenhots
The text was updated successfully, but these errors were encountered: