-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic provisioning of VM instances #19
Comments
Hey, I already started (8f9ac31) cloud provider support which does exactly that. It works a bit different though: You first create a new cluster of type Azure/AWS/GC/Genesis and configure what instance types should be allowed and how many. Once you start a cloud experiment that configured cluster auto-scales up/down depending on the workload. It works without adding new deepkit.yml configuration options. I think we get the first version with this feature as experimental released next month. |
Hi @marcj, Awesome to hear! Ar you using In any case, I would be happy to help and test the functionality with a |
@mullerrwd currently we don't have our own concept for distributed learning. You can certainly configure in your deepkit.yml multiple instances using basically the pipeline feature, and then connect between each other, but that is something I'm still trying to improve. Not yet sure which framework, if any, will be used. I'm currently planning to support GenesisCloud/AWS/GC at the beginning. Microsoft Azure is something I've never used personally, so this will probably be integrated last. |
In a first case I would be interested in a single job experiment provisioning a single VM instance. I have to check the cluster and auto scaling functionality of Azure myself as well since I mostly use GC myself. Although I haven't tested the cluster functionality from GC either. So I understand the primary support for GenesisCloud, GC and AWS. I will dive into the process for Microsoft Azure in the mean time. |
Deepkit doesn't use the auto-scaling function of any of these vendors. It implemented its own algo, and uses just most basic API calls createInstance/terminateInstance/getPublicIP of those vendors. So as long as Azure has equal API calls available it should be easy to integrate. |
Good to know, I'll keep that in mind when testing on MS Azure. I'll be able to help you with Azure support knowing how you support the primary cloud providers. |
Leaving this here for future reference: Microsoft Azure ML cluster @marcj I assume you are using the api as described here: https://cloud.google.com/dataproc/docs/concepts/compute/gpus from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"
# Verify that cluster does not exist already
try:
cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
print('Found existing cluster, use it.')
except ComputeTargetException:
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
max_nodes=4)
cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True) and from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
# Create a new runconfig object
run_amlcompute = RunConfiguration()
# Use the cpu_cluster you created above.
run_amlcompute.target = cpu_cluster
# Enable Docker
run_amlcompute.environment.docker.enabled = True
# Set Docker base image to the default CPU-based image
run_amlcompute.environment.docker.base_image = DEFAULT_CPU_IMAGE
# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
run_amlcompute.environment.python.user_managed_dependencies = False
# Specify CondaDependencies obj, add necessary packages
run_amlcompute.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn']) After which you can kick off the training experiment using the same I'm still working my way through the examples and documentation and my own tests. I will report my findings here for your convenience. |
Checking the configuration (deepkit.yml) documentation I do not see an appropriate method to provision VM instances (E.g: MS Azure) as a node through a REST API.
A current solution is to:
However this does not prevent unnecessary idle time of the instance which will add up to the costs if one does not stop the instance directly after an experiment.
Preferred functionality would be:
deepkit.yml
experiment file through API.Example config file:
If you have a different work around in place I would be happy to hear about it!
The text was updated successfully, but these errors were encountered: