Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Adding New GPU Servers to Magic Castle Cluster #331

Open
OscarDiez opened this issue Nov 2, 2024 · 8 comments
Open

Issues with Adding New GPU Servers to Magic Castle Cluster #331

OscarDiez opened this issue Nov 2, 2024 · 8 comments
Assignees
Labels
azure bug Something isn't working question Further information is requested

Comments

@OscarDiez
Copy link

OscarDiez commented Nov 2, 2024

I’ve been working on adding 3 new GPU servers to the Magic Castle cluster, but unfortunately, I’ve been facing multiple issues with the setup, and I’m at a bit of a standstill.

Issues Encountered:

  1. Puppet Configuration and GPU Drivers
    I’ve been trying to get the NVIDIA drivers and kernel modules properly installed, but Puppet keeps returning the following error:
  Error: Unable to find a match: kmod-nvidia-latest-dkms

As a result, several stages are being skipped due to failed dependencies, including services like nvidia-persistenced and nvidia-dcgm. Despite manually trying to install the correct drivers (such as nvidia-driver-cuda), the error persists.

I’ve checked the logs and Puppet config files but haven’t been able to pinpoint the root cause. Here’s a portion of the error from the Puppet run:

  Error: /Stage[main]/Profile::Gpu::Install::Passthrough/Package[kmod-nvidia-latest-dkms]/ensure: change from 'purged' to 'present' failed.
  1. SLURM Node Availability
    SLURM also seems to be having issues with recognizing the new nodes. The nodes (gpu-node[1-3]) are showing up as down# in SLURM:
PARTITION          AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu-node              up   infinite      3  down# gpu-node[1-3]

When I try to submit jobs to these nodes, I get the following error:

Batch job submission failed: Invalid account or account/partition combination specified
Additionally, jobs remain pending with the reason:

(ReqNodeNotAvail, UnavailableNodes:gpu-node[1-3])
  1. Logs and Configuration
    I’ve checked the slurmd service on the nodes and confirmed that it’s running. I’ve also reviewed the following logs and config files:

/var/log/slurmctld.log on the controller shows node availability issues.
/var/log/slurmd.log on the GPU nodes themselves doesn't reveal much beyond the standard communication errors.
The slurm.conf file appears to correctly define the GPU nodes, but they are still marked as down# in SLURM.
Attempts and Outcome
I’ve tried multiple fixes over the last few days, including:

Manually installing drivers and reconfiguring Puppet.
Restarting SLURM and resuming the nodes via scontrol.
Ensuring Munge is running properly on all nodes.
Updating the SLURM node state using scontrol update nodename=gpu-node1 state=RESUME.
Despite my best efforts, the nodes remain unavailable for job scheduling and spawn via Jupiter, and I’m starting to feel a bit desperate at this point.

I would really appreciate your help with this issue or any pointers to documentation or someone who could assist. It’s been a challenging process, and any guidance you can provide would be invaluable.

@cmd-ntrf cmd-ntrf self-assigned this Nov 5, 2024
@cmd-ntrf cmd-ntrf added the question Further information is requested label Nov 5, 2024
@cmd-ntrf
Copy link
Member

cmd-ntrf commented Nov 5, 2024

Hi Oscar,

Problem 2 and 3 stem from problem 1. When GPU drivers cannot be installed properly, slurmd won't start and the node will never available for jobs.

  1. Which version of Magic Castle are you using?
  2. Which cloud provider are you using?
  3. What image / operating system are you using?

It should be pretty straightforward to find the culprit of your problem once you provide these 3 informations.

Best,
Felix

@OscarDiez
Copy link
Author

OscarDiez commented Nov 5, 2024 via email

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Nov 5, 2024

Could you try from scratch but using the latest beta release instead?
https://github.com/ComputeCanada/magic_castle/releases/tag/14.0.0-beta.6

14.0.0 is just days from being officially release, and it will probably solve your problem.

@OscarDiez
Copy link
Author

OscarDiez commented Nov 6, 2024 via email

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Nov 6, 2024

Yes, unfortunately you will have to recreate users and upload data if you start from scratch.

You could start a new clusters next to the one your already have and move data before destroying it, but you will have to recreate users.

@OscarDiez
Copy link
Author

OscarDiez commented Nov 11, 2024 via email

@cmd-ntrf
Copy link
Member

Hi Oscar,

In the future, I would appreciate if you could use the GitHub web interface to comment on issue. Replying via email disable the markdown rendering which makes your comment somewhat tougher to read.

Your assessment of the issue is correct. A recent change in AzureRM Terraform provider changed the default value of the azurerm_public_ip's sku variable from Basic to Standard. Explicitly defining its value to Basic correctly solves the issues as it was the default value before Azure made that change to the default value. Thank you for reporting this issue, Azure in Magic Castle is underused and this sort of issue often flies under my radar.

To your question:

  • Fixing the issue for now, best way to do it is to modify the network.tf as you did.
  • I will publish a new release with the fix today.
  • 14.0.0-beta.7 do not include a fix, 14.0.0-beta.8 will.
  • Unfortunately, there no are mechanisms in Magic Castle at the moment that facilitate the migration of data and users between clusters as the original intent was disposable clusters for training. I would suggest to create a new cluster before deleting the previous one, than transfer the data via rsync, re-create the users, and then delete the previous cluster.

Finally remark, the tag "gpu-node" for your gpu instance does not exist and therefore your gpu node will not be properly configured. Replace it by the "node". Puppet will correctly identify if the compute node has a gpu and configure it.

@odiezg
Copy link

odiezg commented Nov 14, 2024

Many thanks,
I still get issues with the public_ip and the network.tf.

│ with module.azure.azurerm_public_ip.public_ip["node1"],
│ on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│ 18: resource "azurerm_public_ip" "public_ip" {



│ Error: static IP allocation must be used when creating Standard SKU public IP addresses

│ with module.azure.azurerm_public_ip.public_ip["mgmt1"],
│ on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│ 18: resource "azurerm_public_ip" "public_ip" {



│ Error: static IP allocation must be used when creating Standard SKU public IP addresses

│ with module.azure.azurerm_public_ip.public_ip["node2"],
│ on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│ 18: resource "azurerm_public_ip" "public_ip" {



│ Error: static IP allocation must be used when creating Standard SKU public IP addresses

│ with module.azure.azurerm_public_ip.public_ip["gpu-node1"],
│ on azure/network.tf line 18, in resource "azurerm_public_ip" "public_ip":
│ 18: resource "azurerm_public_ip" "public_ip" {

I tried to use this to solve it, it creates the servers, but after I only can connect via ssh to the login1 server.

locals {
public_ip_skus = {
for k, v in module.design.instances :
k => contains(v.tags, "public") ? "Basic" : "Standard"
}

public_ip_allocation_methods = {
for k, v in module.design.instances :
k => contains(v.tags, "public") || local.public_ip_skus[k] == "Standard" ? "Static" : "Dynamic"
}
}

resource "azurerm_public_ip" "public_ip" {
for_each = module.design.instances
name = format("%s-%s-public-ipv4", var.cluster_name, each.key)
location = var.location
resource_group_name = local.resource_group_name

sku = local.public_ip_skus[each.key]
allocation_method = local.public_ip_allocation_methods[each.key]
}

The cluster is not working and it is not staring any service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
azure bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants