Auto-scaling questions #21

antoinefalisse · 2024-10-01T08:17:32Z

The current implementation of auto-scaling is working well. I am wondering if there are ways to get things to go faster though, especially scale out (I am currently more concerned about people waiting for their data to be processed than paying extra because instances are waiting to be terminated). Here is a set of related questions.

Starting task

We need 3 datapoints with 3 minutes, which means it will take at least 3 minutes for the task to start. Is there a way to change that (eg, 2 datapoints with 2 minutes)?
As you can see in the screenshot below, there is another 3 minutes between service worker has started 1 tasks: task 18d413914fce4bec9f6eb8fd3de3b007 and service worker, taskSet ecs-svc/9445821085410031600) has started 1 tasks: task 18d413914fce4bec9f6eb8fd3de3b007. Is there anything that could be optimized there to make things faster?
As you can see in the screenshot below, there is another 5 minutes for the service worker to reach a steady state. Is there anything that could be optimized there to make things faster?

Overall, it took >11minutes for the instance to start processing trials after the number of pending trials exceeded the threshold. Is there anything that could be done there to make things faster? Eg, more memory, CPUs, decreased buffers, etc.

Stopping task

The TargetTracking-service/modelhealth-processing-cluster-dev/worker-AlarmLow went into alarm at 19:48, but the event Message: Successfully set desired count to 0... only got triggered at 19:52. Why is there a delay (which seems variable based on my tests)?
I am a bit confused with when the EC2 instance is actually terminated. Here it happened at 20:09 (see second screenshot - different timezone), which is 45min after it started, 30min after the scale-in protection was removed, and 17mn after the task stopped. I cannot really make sense of these numbers. Do you have any explanations?

Buffers

Heartbeat timeout: The EC2 instance has a heartbeat timeout of 3600s. How does that play a role in scaling in?
Health_check_grace_period: I found 300s for that but understood that it is just a period during which no health checks are performed. Could you confirm it should not create delays in scaling out/in?
Scale_in_cooldown/Scale_out_cooldown: I found 30s for each but understood that it is just time between operations. I therefore don't think it should dramatically affect the numbers above since I was only playing with 1 instance.
300 seconds to warm up before including in metric: I found that one too but understood that it is just a a period during which the ASG does not consider metrics coming from the EC2 instance (metrics which we don't use I think) and therefore it should not matter for us.

Are there any buffers that I missed that could explain some of the delays I reported above?

Others

When we remove the scale-in protection, we sleep in the instance, meaning we could also terminate it right away since it will no longer be needed. I think things will get clearer for me based on your answers from above, but there seems to be a long time between removing the scale in protection and terminating the instance (>30min).

Screenhots

antoinefalisse · 2024-10-01T09:38:46Z

@sashasimkin
Another question: If the task has been stopped, but the instance has not been unprotected yet (eg this condition not met), will it then keep running (processing data) until it gets unprotected? Just wondering if we should access the number of tasks and divide pending_trials with the number of tasks here.

I am thinking about a case where like 5 tasks had started, then two got stopped. Yet there are still a lot of pending trials. Since we want most processing handled with ASG, this value will be low and this condition will not be met. Therefore the instances corresponding to the tasks that stopped are still protected and will not stop (maybe for a while) until there a very few trials in the queue and this condition is met.

antoinefalisse · 2024-10-01T09:56:46Z

@sashasimkin
Last question for now: It is not entirely clear for me why you subtract trials_baseline here. I feel it delays the start of the task (you need #target + #baseline for it to start) and stops it early, potentially leaving >#target trials in the queue. I'd be tempted to set it to 0. Any thoughts?

sashasimkin · 2024-10-06T08:06:39Z

Hey @antoinefalisse, sorry for the delay here.

I am wondering if there are ways to get things to go faster though, especially scale out (I am currently more concerned about people waiting for their data to be processed than paying extra because instances are waiting to be terminated)

There are multiple, with various efforts and ROI. The biggest problem here is that we're dealing with two layers - EC2 and ECS, each using its own scaling metrics.

The simplest that I see is having a scheduled scaling policy, that sets minimum=1 at the start of the day (where trials possibly might come) and return minimum back to 0 at the end of the day.

An additional option is using EC2 Warm Pools, but that requires much more effort to implement(implementing lifecycle hooks, pre-fetching the Docker Image on the machine), with likely bigger ROI.

Another option that we can try is using step scaling policy, instead of target tracking, which is more responsive, however this will only affect the first leg of scaling - the ECS tasks, while EC2 instances will still have 3 datapoints within 3 minutes.

Answers to specific questions

We need 3 datapoints with 3 minutes, which means it will take at least 3 minutes for the task to start. Is there a way to change that (eg, 2 datapoints with 2 minutes)?

No, these alarms are managed by AWS Auto Scaling and can't be edited.

As you can see in the screenshot below, there is another 3 minutes between service worker has started 1 tasks: task 18d413914fce4bec9f6eb8fd3de3b007 and service worker, taskSet ecs-svc/9445821085410031600) has started 1 tasks: task 18d413914fce4bec9f6eb8fd3de3b007. Is there anything that could be optimized there to make things faster?

This is pretty good actually - it took AWS to start an EC2 instance and have a running task on it.
The optimization possibilities are the ones I listed above: Having instances available(scheduled scaling) or ready(warm pools) + I remember that the images you're using are pretty heavy, so slimming them down or creating custom AMIs for EC2 instances with these images present will improve the speed too.

As you can see in the screenshot below, there is another 5 minutes for the service worker to reach a steady state. Is there anything that could be optimized there to make things faster?

I'm not sure if this is affecting the actual job executions. Steady state is when the running tasks == desired tasks and healthchecks are passing.
I don't see any healthcheck implemented in the task_definition.json.tpl, which might make ECS fallback to whatever task/instance warmup value it has set as default. We can explore this avenue when you confirm the timings.
Can you please confirm this by checking logs of the worker itself, to see timing when it boots up, and a timestamp when it takes its first job for execution?

Overall, it took >11minutes for the instance to start processing trials after the number of pending trials exceeded the threshold. Is there anything that could be done there to make things faster? Eg, more memory, CPUs, decreased buffers, etc.

Back at the previous point - please confirm that it actually took >11m to start processing, please measure the time e2e from trial submitted to the moment worker started processing the trial.

We can also try twice bigger(in terms of the number of GPUs) instances and have two tasks per instance, which, in theory, should speed up a cold boot and dramatically improve the second boot.

The TargetTracking-service/modelhealth-processing-cluster-dev/worker-AlarmLow went into alarm at 19:48, but the event Message: Successfully set desired count to 0... only got triggered at 19:52. Why is there a delay (which seems variable based on my tests)?

AWS Auto Scaling prioritizes availability, so this is pretty much up to AWS, which can be regarded random :)

I am a bit confused with when the EC2 instance is actually terminated. Here it happened at 20:09 (see second screenshot - different timezone), which is 45min after it started, 30min after the scale-in protection was removed, and 17mn after the task stopped. I cannot really make sense of these numbers. Do you have any explanations?

Same as ^, as we rely on AutoScaling repeatedly attempt to terminate the instance after un-protection, it's up to AWS and chances between metric changes and instance un-protection.

Buffers

In general they don't affect much the scaling, so I'll reply only to a relevant part.

Health_check_grace_period: I found 300s for that but understood that it is just a period during which no health checks are performed. Could you confirm it should not create delays in scaling out/in?

It shouldn't, but we can decrease it anyway. This is the 300 seconds instance warmup logic basically.

When we remove the scale-in protection, we sleep in the instance, meaning we could also terminate it right away since it will no longer be needed. I think things will get clearer for me based on your answers from above, but there seems to be a long time between removing the scale in protection and terminating the instance (>30min).

We can't, because if we terminate the instance AWS will spawn a new one.

I'll need a bit more time to answer to the rest of the questions.

sashasimkin · 2024-10-06T18:09:06Z

Also, as an extra comment to the scale-in speed: It can be more responsive if we used https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-scale-in-protection.html, but that sounds like too little ROI.

Another question: If the task has been stopped, but the instance has not been unprotected yet (eg this condition not met), will it then keep running (processing data) until it gets unprotected? Just wondering if we should access the number of tasks and divide pending_trials with the number of tasks here.

It will keep running and processing, until pending_trials (how many there are in queue) is less than onprem capacity.

Sorry, I'm not following another example - e.g. why the tasks were stopped if there are alot of pending trials in the queue?

Since we want most processing handled with ASG

I thought that the idea was that you wanted most processing to happen within OnPrem workers and have ASG join-in when there are more trials than OnPrem capacity?

Last question for now: It is not entirely clear for me why you subtract trials_baseline here. I feel it delays the start of the task (you need #target + #baseline for it to start) and stops it early, potentially leaving >#target trials in the queue. I'd be tempted to set it to 0. Any thoughts?

This is to account for OnPrem workers, so that they keep processing a known amount of trials. Also, right now the logic is set to scale out an instance for var.processing_asg_scaling_target trials in queue minus var.processing_asg_trials_baseline.
Which means, by default, that each instance is expected to be processing 5 trials. You can set the var.processing_asg_scaling_target to 1 and var.processing_asg_trials_baseline to 0 to achieve processing 1 trial per instance, within auto-scaling group only.

I see that screenshots are from ModelHealth, and you mentioned at some point that you don't have OnPrem capacity for that project. In that case - it definitely makes sense to have trials_baseline set to 0, both in terraform and in the code(I suggest to make it an Environment variable in the code, so that same image can be used by both projects w/o re-build).

To summarize everything:

I'd advice you to redo the tests with changes in variables to make tests more responsive and measure the trial submitted to processing start.
Then, according to the test lowest effort would be to switch to step scaling to increase AWS' responsiveness.
In general, I think that the most ROI will be from utilizing EC2 Warm Pools, but this is likely gonna be the most work too - will require updates in the application, and I don't have much experience with Warm Pools, so there will be some trial and error.
The cleanest solution would be to switch to Task protection instead of instance protection (cleaner + more responsive scale-in) and use step scaling too, but this is lower ROI.

Let me know your thoughs on the above, but also what can be useful is to understand your expectations and general bounds of what in your opinion is fast, what is slow already and what is acceptable.

antoinefalisse · 2024-10-07T06:50:25Z

@sashasimkin thanks, this is very helpful. I suggest we first get this deployed and see how that works. Then we can iterate on what would make most sense to optimize speed.

antoinefalisse assigned antoinefalisse and sashasimkin and unassigned antoinefalisse Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-scaling questions #21

Auto-scaling questions #21

antoinefalisse commented Oct 1, 2024 •

edited

Loading

antoinefalisse commented Oct 1, 2024 •

edited

Loading

antoinefalisse commented Oct 1, 2024

sashasimkin commented Oct 6, 2024

sashasimkin commented Oct 6, 2024

antoinefalisse commented Oct 7, 2024

Auto-scaling questions #21

Auto-scaling questions #21

Comments

antoinefalisse commented Oct 1, 2024 • edited Loading

antoinefalisse commented Oct 1, 2024 • edited Loading

antoinefalisse commented Oct 1, 2024

sashasimkin commented Oct 6, 2024

Answers to specific questions

sashasimkin commented Oct 6, 2024

antoinefalisse commented Oct 7, 2024

antoinefalisse commented Oct 1, 2024 •

edited

Loading

antoinefalisse commented Oct 1, 2024 •

edited

Loading