-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backend scaling -- stop instance #113
Comments
|
|
hey @suhlrich, I understand the base capacity problem and I was wrong regarding inactivity logic.
this approach leaves the auto-scaling logic to the configured policies in AWS and separates concerns between what the application does and what the infra control plane is doing. |
Sounds good. Sounds like we should create an API endpoint that returns number_of_jobs. (unless there's an easy way to get it from cloudwatch, but an http request sounds easy to me) |
Yes, the endpoint would work, the biggest value of it - it'll provide realtime and the most precise data possible (as this is source of truth). But cloudwatch might be easier to implement - just adding AWS permissions to the ECS role, and a better fit - we can get an aggregated value over last x min with |
how would we query cloudwatch? Would this be another batch script? I guess it is probably better to be looking at the same value that the ASG rule is looking at. |
hey @suhlrich , so here's the code that will read the metric submitted by stanfordnmbl/opencap-api#173 (comment) : This code requires the
|
@suhlrich here's a bash script (and boto3 code, which should be a better fit) for un-protecting the instance:
Similar approach but as boto3 to be called from python (as in the #117 ), because you will need boto3 for cloudwatch interactions anyway:
Either version will require the following permissions for the processing worker process. But it's just for reference as I'll be adding them in terraform.
P.s: Regarding your original point 2) - you don't need to configure any dedicated AWS credentials for this as the permissions will be inferred from the environment through ECS Task role. |
We need to make sure the backend machines aren't processing a job when they stop, so when a machine has not received a job for a certain amount of time, it should exit the app.py loop, remove its scale-in protection, and be ready for autoscaling to turn it off. There needs to be an env variable distinguishing always-on machines from asg machines.
asg_machine
. If True, it will pause after inactivity. @suhlrich / @antoinefalissedesired_asg_gpu_instances
from cloudwatch and if it is 0, pause and remove scale-in protection. Should this be an API endpoint so we can keep the IAM permissions minimal for these machines? @suhlrich / @antoinefalisse / @olehkorkh-planeksThe text was updated successfully, but these errors were encountered: