Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding ticket renewal and alternative Slurm AuthType #66

Open
3XX0 opened this issue Jun 14, 2022 · 5 comments
Open

Questions regarding ticket renewal and alternative Slurm AuthType #66

3XX0 opened this issue Jun 14, 2022 · 5 comments

Comments

@3XX0
Copy link

3XX0 commented Jun 14, 2022

Hi,

We're in the process of evaluating AUKS for Kerberized deployments and I had few questions:

  • From my understanding, auksdrenewer is responsible for renewing the tickets in auksd, while the SPANK plugin process will renew those on each compute nodes (with auks -R loop). What happens when the ticket expires and is not renewable for long-running jobs? Is there a way to update the ticket ahead of time if the user got a fresh one? If so, does the user have to do this manually with the auks API or can it be automated somehow (something similar to GSS rekeying with PAM maybe?). Does it matter whether the job is running or not and is there any race we need to watch for?

  • Has there been any effort towards replacing Munge with a Kerberos based approach as the Slurm AuthType? It doesn't look like this project is addressing this but I guess most of its infrastructure could be reused for it.

@hautreux
Copy link
Owner

Hi,
here are some answers concerning your first point :

  • your understanding is correct. auksdrenewer is in charge of periodically renewing all the still renewable TGTs pushed and stored in auksd. This ensure that jobs can start with a valid TGT even when they stay pending longer than the ticket initial lifetime.
  • in order to deal with long running jobs, that is to say, jobs that could run longer than the initial TGT renewable time, the auks -R loop helper tasks (started by the SPANK plugin on the compute nodes) renew their TGTs first using auksd through the auks API with a fault back renew mechanism using the KDC. This allows jobs to leverage new TGTs added by the associated users to auksd while the jobs are running. As long as users are pushing TGTs to auksd (whether submitting new jobs or calling auks -a) before the end of the renewable time of their TGT stored in auksd, everything will be fine. Otherwise, jobs will experience IO errors due to the lack of a valid kerberos credential.
  • There is no equivalent of the automatic GSS rekeying to automatically push tickets to auksd. A pull request is still pending to add a PAM module for auks (Added pam plugin module to add krb5 credentials to AUKS repository #22). It could certainly be used to do that. I'd never integrated it but that could be feasible, let me know if you give it a try. If they are multiple interests/requests for that, I could include it.

Concerning replacing Munge with a kerberos based approach as the Slurm AuthType, I would say that it is more a Slurm related feature than an auks one. This should be discussed with the Slurm developpers. But for sure, this is something of interest. I worked with a student in internship on a prototype of kerberized RPCs for Slurm about 10 years ago but it was unfortunately not as simple as creating a new AuthType plugin :(. The auth API of Slurm had to be modified and we never went further than a first roughly working proof of concept (using the GSSAPI, not the Auks internals). I am not even sure that I still have the code/patch, but if you are interested on working on that, I could do some digging and try to find that again.

Matthieu

@3XX0
Copy link
Author

3XX0 commented Jun 16, 2022

Thank you for the detailled explanation, this is pretty much what I expected.
The PAM plugin PR is exactly what I was looking for (somehow I missed it), I will play with it and report back on how it goes.

Regarding Munge, I agree this is more of a Slurm issue and it's good to know that you've looked into it before. I might look into it once we've got everything set up. Don't worry too much about digging the code as it may take a while :)

@3XX0 3XX0 closed this as completed Jun 16, 2022
@3XX0
Copy link
Author

3XX0 commented Jul 19, 2022

Reopening since I have an additional question:

I've had time to experiment a little and I was wondering if there is any reason why the SPANK logic is done in init, user_init and task_exit rather than job_prolog and job_epilog?

There are cases where multiple jobstep can be running simultaneously (e.g. salloc with use_interactive_step). In those cases, there will be multiple unique credentials created and an auks loop for each one of them.

So why is AUKS operating at the jobstep level rather than the job level?

@3XX0 3XX0 reopened this Jul 19, 2022
@3XX0
Copy link
Author

3XX0 commented Apr 26, 2023

The auth API of Slurm had to be modified and we never went further than a first roughly working proof of concept (using the GSSAPI, not the Auks internals). I am not even sure that I still have the code/patch, but if you are interested on working on that, I could do some digging and try to find that again.

@hautreux if it's not too much to ask, I would very interested if you could find it.
I've talked to SchedMD about this and they are interested in seeing a PoC to see what can be done upstream.

@hautreux
Copy link
Owner

hautreux commented May 3, 2023

I am sorry but I am no longer in a position to access that and last time I check I did not find the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants