-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with Adding New GPU Servers to Magic Castle Cluster #331
Comments
Hi Oscar, Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed properly, slurmd won't start and the node will never available for jobs.
It should be pretty straightforward to find the culprit of your problem once you provide these 3 informations. Best, |
Many thanks for the reply, it is much appreciated.
I attach the main.tf file but the responses to your questions are as
follows:
MAgic Castle is 13.5.0,
Cloud is Azure
OS is almalinux 9-gen2
module "azure" {
source = "./azure"
config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git
"
config_version = "13.5.0"
image = {
publisher = "almalinux",
offer = "almalinux-x86_64",
sku = "9-gen2",
version = "9.4.2024050902"
}
instances = {
mgmt = { type = "Standard_B2ms", count = 1, tags = ["mgmt", "puppet",
"nfs"] },
login = { type = "Standard_B2s", count = 1, tags = ["login", "public",
"proxy"] },
node = { type = "Standard_B2s", count = 5, tags = ["node"] },
gpu-node = { type = "Standard_NV6ads_A10_v5", count = 3, tags =
["node", "gpu-node"] }
…On Tue, Nov 5, 2024 at 2:59 PM Félix-Antoine Fortin < ***@***.***> wrote:
Hi Oscar,
Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed
properly, slurmd won't start and the node will never available for jobs.
1. Which version of Magic Castle are you using?
2. Which cloud provider are you using?
3. What image / operating system are you using?
It should be pretty straightforward to find the culprit of your problem
once you provide these 3 informations.
Best,
Felix
—
Reply to this email directly, view it on GitHub
<#331 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFSHEHWHYTQN5IBDDUIZGADZ7DFKTAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJXGI2TGMRUGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Regards,
Oscar Diez
|
Could you try from scratch but using the latest beta release instead? 14.0.0 is just days from being officially release, and it will probably solve your problem. |
Many thanks for your reply,
I will test it later today, but if I try to run again the system from
scratch, will I maintain the users created and their data or will it be
deleting everything and I will need to recreate the users again?
…On Tue, 5 Nov 2024, 20:52 Félix-Antoine Fortin, ***@***.***> wrote:
Could you try from scratch but using the latest beta release instead?
https://github.com/ComputeCanada/magic_castle/releases/tag/14.0.0-beta.6
14.0.0 is just days from being officially release, and it will probably
solve your problem.
—
Reply to this email directly, view it on GitHub
<#331 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFSHEHSGWZ3G4BQCFSZA2DTZ7EV2HAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJYGEZTAMBTGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Yes, unfortunately you will have to recreate users and upload data if you start from scratch. You could start a new clusters next to the one your already have and move data before destroying it, but you will have to recreate users. |
Dear Félix,
Thank you for your assistance with my previous inquiry.
I have destroyed the previous installation and installed the new one. But
it is not working. The new configuration is below, I got the latest version
of magic castle and I am using the newest almalinux version. :
```hcl
module "azure" {
source = "./azure"
config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git
"
config_version = "14.0.0-beta.7"
cluster_name = "hpcie"
domain = "labs.faculty.ie.edu"
# Using the AZure CLI, you can list the image versions that are available
to use. For example,
# az vm image list --location eastus --publisher almalinux --offer
almalinux-x86_64 --sku 9-gen2 --all --output table
# az vm image list --location eastus --publisher almalinux --offer
almalinux-arm --sku 9-arm-gen2 --all --output table
# (Note: available versions may be location specific!)
image = {
publisher = "almalinux",
offer = "almalinux-x86_64",
sku = "9-gen2",
version = "9.4.2024050902"
}
instances = {
mgmt = { type = "Standard_DS2_v2", count = 1, tags = ["mgmt",
"puppet", "nfs"] },
login = { type = "Standard_DS1_v2", count = 1, tags = ["login",
"public", "proxy"] },
node = { type = "Standard_DS1_v2", count = 4, tags = ["node"] },
gpu = { type = "Standard_NV6ads_A10_v5", count = 2, tags =
["gpu-node"] }
}
```
*Issues Encountered:*
When I try to apply the Terraform configuration to deploy the cluster with
the new GPU nodes, I receive the following error:
```
Error: static IP allocation must be used when creating Standard SKU
public IP addresses
with module.azure.azurerm_public_ip.public_ip["gpu1"],
on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
18: resource "azurerm_public_ip" "public_ip" {
```
This error repeats for each public IP resource being created.
*Troubleshooting Steps and Changes Tried:*
1.
*Understanding the Error:*
- The error suggests that when creating Standard SKU public IP
addresses, the allocation_method must be set to "Static", but in the
configuration, some public IPs are set to "Dynamic".
2.
*Examining network.tf <http://network.tf>:*
Here's the relevant portion of my network.tf:
```
# Create public IPs
resource "azurerm_public_ip" "public_ip" {
for_each = module.design.instances
name = format("%s-%s-public-ipv4",
var.cluster_name, each.key)
location = var.location
resource_group_name = local.resource_group_name
allocation_method = contains(each.value.tags, "public") ?
"Static" : "Dynamic"
}
```
3.
*Attempted Fixes:*
-
*Option 1:* Explicitly set the sku to "Basic" in the azurerm_public_ip
resource to allow "Dynamic" allocation:
```
resource "azurerm_public_ip" "public_ip" {
for_each = module.design.instances
name = format("%s-%s-public-ipv4",
var.cluster_name, each.key)
location = var.location
resource_group_name = local.resource_group_name
allocation_method = contains(each.value.tags, "public") ?
"Static" : "Dynamic"
sku = "Basic"
}
```
- *Result:* The error was resolved, but I'm unsure if using the Basic
SKU is appropriate for my use case.
-
4.
*Constraints:*
-
I prefer not to modify the module files (network.tf) directly to keep
the deployment process consistent and maintainable.
-
I attempted to make changes in main.tf to resolve the issue without
modifying network.tf, but was unsuccessful.
*Questions:*
- Is there a recommended way to address this issue without modifying the
module files?
- Is there an updated version of Magic Castle that resolves this problem?
- If I upgrade to version 14.0.0-beta.7 as suggested, will it resolve
this issue, and what are the implications for existing users and data?
*Additional Information:*
- I'm concerned about redeploying the cluster from scratch due to the
potential loss of existing user data.
- If upgrading to the latest beta version is the best solution, could
you advise on the best way to migrate existing data and users?
Thank you for your assistance.
Best regards,
Oscar Diez
…On Wed, Nov 6, 2024 at 3:17 PM Félix-Antoine Fortin < ***@***.***> wrote:
Yes, unfortunately you will have to recreate users and upload data if you
start from scratch.
You could start a new clusters next to the one your already have and move
data before destroying it, but you will have to recreate users.
—
Reply to this email directly, view it on GitHub
<#331 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFSHEHRRH6HSIOOBADWR5MTZ7IQFZAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJZHA3TGMJRGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Regards,
Oscar Diez
|
Hi Oscar, In the future, I would appreciate if you could use the GitHub web interface to comment on issue. Replying via email disable the markdown rendering which makes your comment somewhat tougher to read. Your assessment of the issue is correct. A recent change in AzureRM Terraform provider changed the default value of the To your question:
Finally remark, the tag |
Many thanks, │ with module.azure.azurerm_public_ip.public_ip["node1"], I tried to use this to solve it, it creates the servers, but after I only can connect via ssh to the login1 server. locals { public_ip_allocation_methods = { resource "azurerm_public_ip" "public_ip" { sku = local.public_ip_skus[each.key] The cluster is not working and it is not staring any service. |
I’ve been working on adding 3 new GPU servers to the Magic Castle cluster, but unfortunately, I’ve been facing multiple issues with the setup, and I’m at a bit of a standstill.
Issues Encountered:
I’ve been trying to get the NVIDIA drivers and kernel modules properly installed, but Puppet keeps returning the following error:
As a result, several stages are being skipped due to failed dependencies, including services like nvidia-persistenced and nvidia-dcgm. Despite manually trying to install the correct drivers (such as nvidia-driver-cuda), the error persists.
I’ve checked the logs and Puppet config files but haven’t been able to pinpoint the root cause. Here’s a portion of the error from the Puppet run:
SLURM also seems to be having issues with recognizing the new nodes. The nodes (gpu-node[1-3]) are showing up as down# in SLURM:
When I try to submit jobs to these nodes, I get the following error:
Batch job submission failed: Invalid account or account/partition combination specified
Additionally, jobs remain pending with the reason:
I’ve checked the slurmd service on the nodes and confirmed that it’s running. I’ve also reviewed the following logs and config files:
/var/log/slurmctld.log on the controller shows node availability issues.
/var/log/slurmd.log on the GPU nodes themselves doesn't reveal much beyond the standard communication errors.
The slurm.conf file appears to correctly define the GPU nodes, but they are still marked as down# in SLURM.
Attempts and Outcome
I’ve tried multiple fixes over the last few days, including:
Manually installing drivers and reconfiguring Puppet.
Restarting SLURM and resuming the nodes via scontrol.
Ensuring Munge is running properly on all nodes.
Updating the SLURM node state using scontrol update nodename=gpu-node1 state=RESUME.
Despite my best efforts, the nodes remain unavailable for job scheduling and spawn via Jupiter, and I’m starting to feel a bit desperate at this point.
I would really appreciate your help with this issue or any pointers to documentation or someone who could assist. It’s been a challenging process, and any guidance you can provide would be invaluable.
The text was updated successfully, but these errors were encountered: