Issues with Adding New GPU Servers to Magic Castle Cluster #331

OscarDiez · 2024-11-02T07:22:44Z

I’ve been working on adding 3 new GPU servers to the Magic Castle cluster, but unfortunately, I’ve been facing multiple issues with the setup, and I’m at a bit of a standstill.

Issues Encountered:

Puppet Configuration and GPU Drivers
I’ve been trying to get the NVIDIA drivers and kernel modules properly installed, but Puppet keeps returning the following error:

  Error: Unable to find a match: kmod-nvidia-latest-dkms

As a result, several stages are being skipped due to failed dependencies, including services like nvidia-persistenced and nvidia-dcgm. Despite manually trying to install the correct drivers (such as nvidia-driver-cuda), the error persists.

I’ve checked the logs and Puppet config files but haven’t been able to pinpoint the root cause. Here’s a portion of the error from the Puppet run:

  Error: /Stage[main]/Profile::Gpu::Install::Passthrough/Package[kmod-nvidia-latest-dkms]/ensure: change from 'purged' to 'present' failed.

SLURM Node Availability
SLURM also seems to be having issues with recognizing the new nodes. The nodes (gpu-node[1-3]) are showing up as down# in SLURM:

PARTITION          AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu-node              up   infinite      3  down# gpu-node[1-3]

When I try to submit jobs to these nodes, I get the following error:

Batch job submission failed: Invalid account or account/partition combination specified
Additionally, jobs remain pending with the reason:

(ReqNodeNotAvail, UnavailableNodes:gpu-node[1-3])

Logs and Configuration
I’ve checked the slurmd service on the nodes and confirmed that it’s running. I’ve also reviewed the following logs and config files:

/var/log/slurmctld.log on the controller shows node availability issues.
/var/log/slurmd.log on the GPU nodes themselves doesn't reveal much beyond the standard communication errors.
The slurm.conf file appears to correctly define the GPU nodes, but they are still marked as down# in SLURM.
Attempts and Outcome
I’ve tried multiple fixes over the last few days, including:

Manually installing drivers and reconfiguring Puppet.
Restarting SLURM and resuming the nodes via scontrol.
Ensuring Munge is running properly on all nodes.
Updating the SLURM node state using scontrol update nodename=gpu-node1 state=RESUME.
Despite my best efforts, the nodes remain unavailable for job scheduling and spawn via Jupiter, and I’m starting to feel a bit desperate at this point.

I would really appreciate your help with this issue or any pointers to documentation or someone who could assist. It’s been a challenging process, and any guidance you can provide would be invaluable.

The text was updated successfully, but these errors were encountered:

cmd-ntrf · 2024-11-05T13:58:42Z

Hi Oscar,

Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed properly, slurmd won't start and the node will never available for jobs.

Which version of Magic Castle are you using?
Which cloud provider are you using?
What image / operating system are you using?

It should be pretty straightforward to find the culprit of your problem once you provide these 3 informations.

Best,
Felix

OscarDiez · 2024-11-05T17:03:02Z

Many thanks for the reply, it is much appreciated. I attach the main.tf file but the responses to your questions are as follows: MAgic Castle is 13.5.0, Cloud is Azure OS is almalinux 9-gen2 module "azure" { source = "./azure" config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git " config_version = "13.5.0" image = { publisher = "almalinux", offer = "almalinux-x86_64", sku = "9-gen2", version = "9.4.2024050902" } instances = { mgmt = { type = "Standard_B2ms", count = 1, tags = ["mgmt", "puppet", "nfs"] }, login = { type = "Standard_B2s", count = 1, tags = ["login", "public", "proxy"] }, node = { type = "Standard_B2s", count = 5, tags = ["node"] }, gpu-node = { type = "Standard_NV6ads_A10_v5", count = 3, tags = ["node", "gpu-node"] }

…

On Tue, Nov 5, 2024 at 2:59 PM Félix-Antoine Fortin < ***@***.***> wrote: Hi Oscar, Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed properly, slurmd won't start and the node will never available for jobs. 1. Which version of Magic Castle are you using? 2. Which cloud provider are you using? 3. What image / operating system are you using? It should be pretty straightforward to find the culprit of your problem once you provide these 3 informations. Best, Felix — Reply to this email directly, view it on GitHub <#331 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFSHEHWHYTQN5IBDDUIZGADZ7DFKTAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJXGI2TGMRUGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Regards, Oscar Diez

cmd-ntrf · 2024-11-05T20:52:30Z

Could you try from scratch but using the latest beta release instead?
https://github.com/ComputeCanada/magic_castle/releases/tag/14.0.0-beta.6

14.0.0 is just days from being officially release, and it will probably solve your problem.

OscarDiez · 2024-11-06T06:21:36Z

Many thanks for your reply, I will test it later today, but if I try to run again the system from scratch, will I maintain the users created and their data or will it be deleting everything and I will need to recreate the users again?

…

On Tue, 5 Nov 2024, 20:52 Félix-Antoine Fortin, ***@***.***> wrote: Could you try from scratch but using the latest beta release instead? https://github.com/ComputeCanada/magic_castle/releases/tag/14.0.0-beta.6 14.0.0 is just days from being officially release, and it will probably solve your problem. — Reply to this email directly, view it on GitHub <#331 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFSHEHSGWZ3G4BQCFSZA2DTZ7EV2HAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJYGEZTAMBTGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

cmd-ntrf · 2024-11-06T14:16:36Z

Yes, unfortunately you will have to recreate users and upload data if you start from scratch.

You could start a new clusters next to the one your already have and move data before destroying it, but you will have to recreate users.

OscarDiez · 2024-11-11T23:21:59Z

Dear Félix, Thank you for your assistance with my previous inquiry. I have destroyed the previous installation and installed the new one. But it is not working. The new configuration is below, I got the latest version of magic castle and I am using the newest almalinux version. : ```hcl module "azure" { source = "./azure" config_git_url = "https://github.com/ComputeCanada/puppet-magic_castle.git " config_version = "14.0.0-beta.7" cluster_name = "hpcie" domain = "labs.faculty.ie.edu" # Using the AZure CLI, you can list the image versions that are available to use. For example, # az vm image list --location eastus --publisher almalinux --offer almalinux-x86_64 --sku 9-gen2 --all --output table # az vm image list --location eastus --publisher almalinux --offer almalinux-arm --sku 9-arm-gen2 --all --output table # (Note: available versions may be location specific!) image = { publisher = "almalinux", offer = "almalinux-x86_64", sku = "9-gen2", version = "9.4.2024050902" } instances = { mgmt = { type = "Standard_DS2_v2", count = 1, tags = ["mgmt", "puppet", "nfs"] }, login = { type = "Standard_DS1_v2", count = 1, tags = ["login", "public", "proxy"] }, node = { type = "Standard_DS1_v2", count = 4, tags = ["node"] }, gpu = { type = "Standard_NV6ads_A10_v5", count = 2, tags = ["gpu-node"] } } ``` *Issues Encountered:* When I try to apply the Terraform configuration to deploy the cluster with the new GPU nodes, I receive the following error: ``` Error: static IP allocation must be used when creating Standard SKU public IP addresses with module.azure.azurerm_public_ip.public_ip["gpu1"], on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip": 18: resource "azurerm_public_ip" "public_ip" { ``` This error repeats for each public IP resource being created. *Troubleshooting Steps and Changes Tried:* 1. *Understanding the Error:* - The error suggests that when creating Standard SKU public IP addresses, the allocation_method must be set to "Static", but in the configuration, some public IPs are set to "Dynamic". 2. *Examining network.tf <http://network.tf>:* Here's the relevant portion of my network.tf: ``` # Create public IPs resource "azurerm_public_ip" "public_ip" { for_each = module.design.instances name = format("%s-%s-public-ipv4", var.cluster_name, each.key) location = var.location resource_group_name = local.resource_group_name allocation_method = contains(each.value.tags, "public") ? "Static" : "Dynamic" } ``` 3. *Attempted Fixes:* - *Option 1:* Explicitly set the sku to "Basic" in the azurerm_public_ip resource to allow "Dynamic" allocation: ``` resource "azurerm_public_ip" "public_ip" { for_each = module.design.instances name = format("%s-%s-public-ipv4", var.cluster_name, each.key) location = var.location resource_group_name = local.resource_group_name allocation_method = contains(each.value.tags, "public") ? "Static" : "Dynamic" sku = "Basic" } ``` - *Result:* The error was resolved, but I'm unsure if using the Basic SKU is appropriate for my use case. - 4. *Constraints:* - I prefer not to modify the module files (network.tf) directly to keep the deployment process consistent and maintainable. - I attempted to make changes in main.tf to resolve the issue without modifying network.tf, but was unsuccessful. *Questions:* - Is there a recommended way to address this issue without modifying the module files? - Is there an updated version of Magic Castle that resolves this problem? - If I upgrade to version 14.0.0-beta.7 as suggested, will it resolve this issue, and what are the implications for existing users and data? *Additional Information:* - I'm concerned about redeploying the cluster from scratch due to the potential loss of existing user data. - If upgrading to the latest beta version is the best solution, could you advise on the best way to migrate existing data and users? Thank you for your assistance. Best regards, Oscar Diez

…

On Wed, Nov 6, 2024 at 3:17 PM Félix-Antoine Fortin < ***@***.***> wrote: Yes, unfortunately you will have to recreate users and upload data if you start from scratch. You could start a new clusters next to the one your already have and move data before destroying it, but you will have to recreate users. — Reply to this email directly, view it on GitHub <#331 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFSHEHRRH6HSIOOBADWR5MTZ7IQFZAVCNFSM6AAAAABRBPHGCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJZHA3TGMJRGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Regards, Oscar Diez

cmd-ntrf · 2024-11-12T14:01:28Z

Hi Oscar,

In the future, I would appreciate if you could use the GitHub web interface to comment on issue. Replying via email disable the markdown rendering which makes your comment somewhat tougher to read.

Your assessment of the issue is correct. A recent change in AzureRM Terraform provider changed the default value of the azurerm_public_ip's sku variable from Basic to Standard. Explicitly defining its value to Basic correctly solves the issues as it was the default value before Azure made that change to the default value. Thank you for reporting this issue, Azure in Magic Castle is underused and this sort of issue often flies under my radar.

To your question:

Fixing the issue for now, best way to do it is to modify the network.tf as you did.
I will publish a new release with the fix today.
14.0.0-beta.7 do not include a fix, 14.0.0-beta.8 will.
Unfortunately, there no are mechanisms in Magic Castle at the moment that facilitate the migration of data and users between clusters as the original intent was disposable clusters for training. I would suggest to create a new cluster before deleting the previous one, than transfer the data via rsync, re-create the users, and then delete the previous cluster.

Finally remark, the tag "gpu-node" for your gpu instance does not exist and therefore your gpu node will not be properly configured. Replace it by the "node". Puppet will correctly identify if the compute node has a gpu and configure it.

issue #331

odiezg · 2024-11-14T00:25:44Z

Many thanks,
I still get issues with the public_ip and the network.tf.

│ with module.azure.azurerm_public_ip.public_ip["node1"],
│ on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│ 18: resource "azurerm_public_ip" "public_ip" {
│
╵
╷
│ Error: static IP allocation must be used when creating Standard SKU public IP addresses
│
│ with module.azure.azurerm_public_ip.public_ip["mgmt1"],
│ on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│ 18: resource "azurerm_public_ip" "public_ip" {
│
╵
╷
│ Error: static IP allocation must be used when creating Standard SKU public IP addresses
│
│ with module.azure.azurerm_public_ip.public_ip["node2"],
│ on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│ 18: resource "azurerm_public_ip" "public_ip" {
│
╵
╷
│ Error: static IP allocation must be used when creating Standard SKU public IP addresses
│
│ with module.azure.azurerm_public_ip.public_ip["gpu-node1"],
│ on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│ 18: resource "azurerm_public_ip" "public_ip" {
│

I tried to use this to solve it, it creates the servers, but after I only can connect via ssh to the login1 server.

locals {
public_ip_skus = {
for k, v in module.design.instances :
k => contains(v.tags, "public") ? "Basic" : "Standard"
}

public_ip_allocation_methods = {
for k, v in module.design.instances :
k => contains(v.tags, "public") || local.public_ip_skus[k] == "Standard" ? "Static" : "Dynamic"
}
}

resource "azurerm_public_ip" "public_ip" {
for_each = module.design.instances
name = format("%s-%s-public-ipv4", var.cluster_name, each.key)
location = var.location
resource_group_name = local.resource_group_name

sku = local.public_ip_skus[each.key]
allocation_method = local.public_ip_allocation_methods[each.key]
}

The cluster is not working and it is not staring any service.

cmd-ntrf self-assigned this Nov 5, 2024

cmd-ntrf added the question Further information is requested label Nov 5, 2024

cmd-ntrf added bug Something isn't working azure labels Nov 12, 2024

cmd-ntrf added a commit that referenced this issue Nov 12, 2024

Define a value for sku in azurerm_public_ip

da844fb

issue #331

cmd-ntrf mentioned this issue Nov 12, 2024

Define a value for sku in azurerm_public_ip #332

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Adding New GPU Servers to Magic Castle Cluster #331

Issues with Adding New GPU Servers to Magic Castle Cluster #331

OscarDiez commented Nov 2, 2024 •

edited by cmd-ntrf

Loading

cmd-ntrf commented Nov 5, 2024

OscarDiez commented Nov 5, 2024 via email

cmd-ntrf commented Nov 5, 2024

OscarDiez commented Nov 6, 2024 via email

cmd-ntrf commented Nov 6, 2024

OscarDiez commented Nov 11, 2024 via email •

edited by cmd-ntrf

Loading

cmd-ntrf commented Nov 12, 2024

odiezg commented Nov 14, 2024

Issues with Adding New GPU Servers to Magic Castle Cluster #331

Issues with Adding New GPU Servers to Magic Castle Cluster #331

Comments

OscarDiez commented Nov 2, 2024 • edited by cmd-ntrf Loading

cmd-ntrf commented Nov 5, 2024

OscarDiez commented Nov 5, 2024 via email

cmd-ntrf commented Nov 5, 2024

OscarDiez commented Nov 6, 2024 via email

cmd-ntrf commented Nov 6, 2024

OscarDiez commented Nov 11, 2024 via email • edited by cmd-ntrf Loading

cmd-ntrf commented Nov 12, 2024

odiezg commented Nov 14, 2024

OscarDiez commented Nov 2, 2024 •

edited by cmd-ntrf

Loading

OscarDiez commented Nov 11, 2024 via email •

edited by cmd-ntrf

Loading