Backend scaling -- stop instance #113

suhlrich · 2023-09-25T17:14:32Z

We need to make sure the backend machines aren't processing a job when they stop, so when a machine has not received a job for a certain amount of time, it should exit the app.py loop, remove its scale-in protection, and be ready for autoscaling to turn it off. There needs to be an env variable distinguishing always-on machines from asg machines.

1) env file: create env var asg_machine . If True, it will pause after inactivity. @suhlrich / @antoinefalisse
- @sashasimkin suggests to use AWS command instead: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-metadata.html
2) Add aws keys to env to use AWS CLI @suhlrich or @antoinefalisse
3) bash script for identifying machine name and removing scale-in protection @sashasimkin
4) check desired_asg_gpu_instances from cloudwatch and if it is 0, pause and remove scale-in protection. Should this be an API endpoint so we can keep the IAM permissions minimal for these machines? @suhlrich / @antoinefalisse / @olehkorkh-planeks

The text was updated successfully, but these errors were encountered:

sashasimkin · 2024-04-23T21:56:16Z

I'll provide the bash script to disable the protection later, but the important part of it would be to configure IAM role & instance profile to have permissions to execute that command.
I believe there's no need to have access to any metric and the logic should be if no jobs for X minutes --> pause work
It's important to sleep in the cycle and not stop the process, because upon stopping ECS would restart the container which might have adverse effects of resetting counters.

suhlrich · 2024-04-23T22:16:57Z

The problem lies with our base capacity. We want to turn off the EC2 instances once the number of jobs falls below some threshold that we think is managable on our on-prem server. If we turn off EC2 instances when they don't get a job, then we only turn them off when the queue is empty, which would result in them remaining on more than necessary if they happen to pick up a job and the on-prem servers are not busy. If we query the number of desired machines, then we can have more complex logic regarding when to turn off.

sashasimkin · 2024-04-24T10:54:58Z

hey @suhlrich, I understand the base capacity problem and I was wrong regarding inactivity logic.
At the same time, what we can do is change logic to be something like:

if number_of_jobs < settings.ONPREM_CAPACITY:
    unprotect_instance()
    pause_work()

this approach leaves the auto-scaling logic to the configured policies in AWS and separates concerns between what the application does and what the infra control plane is doing.

suhlrich · 2024-04-24T21:36:10Z

Sounds good. Sounds like we should create an API endpoint that returns number_of_jobs. (unless there's an easy way to get it from cloudwatch, but an http request sounds easy to me)

sashasimkin · 2024-04-25T11:59:13Z

Yes, the endpoint would work, the biggest value of it - it'll provide realtime and the most precise data possible (as this is source of truth).

But cloudwatch might be easier to implement - just adding AWS permissions to the ECS role, and a better fit - we can get an aggregated value over last x min with get_metric_statistics/get_metric_data.

suhlrich · 2024-04-26T19:54:45Z

how would we query cloudwatch? Would this be another batch script? I guess it is probably better to be looking at the same value that the ASG rule is looking at.

sashasimkin · 2024-05-03T13:47:20Z

hey @suhlrich , so here's the code that will read the metric submitted by stanfordnmbl/opencap-api#173 (comment) :

This code requires the cloudwatch:GetMetricStatistics but similar to the as above, I'll add it to the worker ECS service permissions.

import boto3
from datetime import datetime, timedelta

def get_metric_average(namespace, metric_name, start_time, end_time, period):
    """
    Fetch the average value of a specific metric from AWS CloudWatch.

    Parameters:
    - namespace (str): The namespace for the metric data.
    - metric_name (str): The name of the metric.
    - start_time (datetime): Start time for the data retrieval.
    - end_time (datetime): End time for the data retrieval.
    - period (int): The granularity, in seconds, of the data points returned.
    """
    client = boto3.client('cloudwatch')
    response = client.get_metric_statistics(
        Namespace=namespace,
        MetricName=metric_name,
        StartTime=start_time,
        EndTime=end_time,
        Period=period,
        Statistics=['Average']  # Correctly specifying 'Average' here
    )
    return response

def get_number_of_pending_trials():
    # Time range setup for the last 1 minute
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(minutes=1)
    period = 60  # Period in seconds

    # Fetch the metric data
    stats = get_metric_average(
        'Custom/opencap',  # or 'Custom/opencap' for production
        'opencap_trials_pending',
        start_time, end_time, period
    )

    if stats['Datapoints']:
        average = stats['Datapoints'][0]['Average']
        print(f"Average value of '{metric_name}' over the last minute: {average}")
    else:
        print("No data points found.")
        # Maybe raise an exception or do nothing to have control-loop retry this call later
        return None

    return average

sashasimkin · 2024-05-03T14:02:50Z

@suhlrich here's a bash script (and boto3 code, which should be a better fit) for un-protecting the instance:

#!/bin/bash
# Retrieve the Instance ID
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
# Retrieve the Auto Scaling Group name associated with this instance
ASG_NAME=$(aws autoscaling describe-auto-scaling-instances --instance-ids $INSTANCE_ID --query 'AutoScalingInstances[0].AutoScalingGroupName' --output text)
# Remove protection from this instance in its first(and only) autoscaling group
aws autoscaling set-instance-protection --instance-ids $INSTANCE_ID --auto-scaling-group-name $ASG_NAME --no-protected-from-scale-in

Similar approach but as boto3 to be called from python (as in the #117 ), because you will need boto3 for cloudwatch interactions anyway:

import boto3
import requests

def get_instance_id():
    """Retrieve the instance ID from EC2 metadata."""
    response = requests.get("http://169.254.169.254/latest/meta-data/instance-id")
    return response.text

def get_auto_scaling_group_name(instance_id):
    """Retrieve the Auto Scaling Group name using the instance ID."""
    client = boto3.client('autoscaling')
    response = client.describe_auto_scaling_instances(InstanceIds=[instance_id])
    asg_name = response['AutoScalingInstances'][0]['AutoScalingGroupName']
    return asg_name

def set_instance_protection(instance_id, asg_name, protect):
    """Set or remove instance protection."""
    client = boto3.client('autoscaling')
    client.set_instance_protection(
        InstanceIds=[instance_id],
        AutoScalingGroupName=asg_name,
        ProtectedFromScaleIn=protect
    )

def unprotect_current_instance():
    instance_id = get_instance_id()
    asg_name = get_auto_scaling_group_name(instance_id)
    set_instance_protection(instance_id, asg_name, protect=False)

Either version will require the following permissions for the processing worker process. But it's just for reference as I'll be adding them in terraform.

            "Action": [
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:SetInstanceProtection"
            ],

P.s: Regarding your original point 2) - you don't need to configure any dedicated AWS credentials for this as the permissions will be inferred from the environment through ECS Task role.

suhlrich mentioned this issue Sep 25, 2023

Scaling backend machines stanfordnmbl/opencap-api#109

Closed

6 tasks

suhlrich mentioned this issue Oct 11, 2023

[WIP] Scaling on AWS #117

Merged

suhlrich mentioned this issue Apr 19, 2024

Auto Scaling Central Issue stanfordnmbl/opencap-api#174

Closed

7 tasks

sashasimkin mentioned this issue Apr 24, 2024

automatically spin up an instance based on opencap-core docker image stanfordnmbl/opencap-infrastructure#14

Closed

sashasimkin mentioned this issue May 3, 2024

Scale up logic stanfordnmbl/opencap-api#173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend scaling -- stop instance #113

Backend scaling -- stop instance #113

suhlrich commented Sep 25, 2023 •

edited by antoinefalisse

Loading

sashasimkin commented Apr 23, 2024

suhlrich commented Apr 23, 2024

sashasimkin commented Apr 24, 2024

suhlrich commented Apr 24, 2024

sashasimkin commented Apr 25, 2024

suhlrich commented Apr 26, 2024

sashasimkin commented May 3, 2024

sashasimkin commented May 3, 2024

Backend scaling -- stop instance #113

Backend scaling -- stop instance #113

Comments

suhlrich commented Sep 25, 2023 • edited by antoinefalisse Loading

sashasimkin commented Apr 23, 2024

suhlrich commented Apr 23, 2024

sashasimkin commented Apr 24, 2024

suhlrich commented Apr 24, 2024

sashasimkin commented Apr 25, 2024

suhlrich commented Apr 26, 2024

sashasimkin commented May 3, 2024

sashasimkin commented May 3, 2024

suhlrich commented Sep 25, 2023 •

edited by antoinefalisse

Loading