-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto Scaling Central Issue #174
Comments
Hi @suhlrich, I have a few small comments to the logic:
There's no variables in CloudWatch per-se, all you need to do is call https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudwatch/client/put_metric_data.html from the celery task.
I advise that instead of having the
I advice that we use ECS on EC2 to simplify running of the image that you are pushing to ECR. I saw in the infra repo some related to this code, but it needs checking and polishing to make it working in general and with auto-scaling.
This will be just https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-autoscaling-targettracking.html of the agreed value of |
@sashasimkin: Is it not possible to use an n_desired_asg_instances variable for the ASG target? This way, we can implement whatever logic we like here (#173) that is accessible by the ASG and within the GPU servers so they can know when to shut down: |
@suhlrich it is possible to use In general, the application doesn't manage the number of instances to process the job, but this logic is implemented in the infrastructure layer based on various factors. I've replied here about termination logic. |
@sashasimkin So we can implement similar logic to here: #173 in the infrastructure level? |
@suhlrich yes - exactly, and the logic will be simpler. I.e. instead of calculating the number of instances and tracking the numbers before/after scaling, we will have simpler target tracking that periodically checks if |
Enable usage of Launch Template because AWS has deprecated Launch Configuration Add missing capacity provider settings Add stub resources for auto-scaling configuration re. #14, stanfordnmbl/opencap-api#174
…essing Even though the code is there - it's not functional yet because neither dev or prod have autoscaling enabled re. #14, stanfordnmbl/opencap-api#174
@sashasimkin let's use g5.2xlarge instances. |
We'd like to have surge GPU capacity using AWS auto-scaling. We will have base capacity that is always running, so this will only activate if the queue is a certain length.
desired_asg_gpu_instances
that will get updated by the celery queue check and checked by the auto-scaling rule. @sashasimkindesired_asg_gpu_instances
on cloudwatch: Scale up logic #173 @olehkorkh-planeksdesired_asg_gpu_instances
from cloudwatch and spins up/down machines. Spun up machines should have scale-in protection.desired_asg_gpu_instances
. See discussion here Backend scaling -- stop instance opencap-core#113 @olehkorkh-planeks@olehkorkh-planeks @sashasimkin @antoinefalisse please read over and update this.
The text was updated successfully, but these errors were encountered: