Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add plugin for monitoring GPU usage per user #947

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

htakano
Copy link

@htakano htakano commented Sep 25, 2018

It's similar to nvidia_gpu_, but shows each users' GPU usage graph.

@htakano htakano changed the title Add plugin for monitoring GPU usage(memory) per user Add plugin for monitoring GPU usage per user Sep 25, 2018
Copy link
Collaborator

@sumpfralle sumpfralle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for such a nice awk plugin :)

Your plugin differs from the nvidia_gpu_ plugin at least by structure (using multigraph - yeah!) and probably also by content. Could you please describe the differences in the documentation header (e.g. USAGE)?

For bonus karma: maybe you want to upload example graphs below example-graphs/PLUGIN_NAME-day.png?

fi
fi

gpuUSERS=$(clean_fieldname "$(ls /home)" | tr "\n" " ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels slightly dirty to me, since it uses the non-documented behaviour of clean_fieldname with regard to multi-line input. Please use something like the following instead:

find /home -maxdepth 1 -mindepth 1 -type d | while read -r dname; do clean_fieldname "$(basename "$dname")"; done

Btw.: is the underlying assumption (relevant users having a home directory below /home/ that matches their name) a good choice?
How about the following - in order to get at least around the directory/username match requirement?

getent passwd | cut -d : -f 1,6 | grep ":/home/" | cut -d : -f 1

(I cannot judge its portability - maybe you can comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some environments "getent passwd" will not return users in LDAP (or even YP/NIS):

$ getent passwd | grep niclangf
$ getent passwd niclangf
niclangf:*:7181:10001:niclangf:/home/niclangf:/bin/bash

and in some environments "getent passwd" will return many many thousands of users.

What you find in /home is also very dodgy. Home directories can be mounted many other places in large environments.

So getting a "full catalog" of users is not very reliable or scalable. But if you have either a username or a UID this works

$ getent passwd niclangf
niclangf:*:7181:10001:niclangf:/home/niclangf:/bin/bash
$ getent passwd 7181
niclangf:*:7181:10001:niclangf:/home/niclangf:/bin/bash

If you want to provide a stable list of users to keep the graphs stable then persist the usernames... but I see I never got around to writing persistence functions for plugin.sh... There are good hints in Munin/Plugin.pm about what environment variables and such to use if you feel wizardly and want to write a file to persist something at least.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what could be a good and portable approach for requesting "all non-system users" (I think, this is the goal)?
I have the feeling, that we cannot provide a really good and platform neutral implementation with the little effort that is justified for this task in this context. Thus I would be happy with anything that does not feel too dirty (maybe even the original proposal).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't. Just graph for those that have been using the GPU and somehow persist it.

Copy link
Author

@htakano htakano Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for good comments.
If there isn't suitable implementation for listing all non-system users,
then I think it is still good for users to be configurable.
The cpubyuser plugin has similar implementation.

}
return j;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove trailing whitespace.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise excellent plugin!

END {
if (n < 0) {

print "No NVIDIA GPUs detected. Exiting."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please print this to stderr and exit with an error.


=head1 NAME

gpubyuser - Plugin to monitor GPU memory usage by user
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention, this is the full name of the plugin file.

@htakano
Copy link
Author

htakano commented Sep 26, 2018

I've modified some parts based on your feedback.

@sumpfralle
Copy link
Collaborator

It looks good to me, now!

Just one question:

Otherwise graph shows users that have been using GPUs only.

Currently gpuUSERS defaults to an empty string. This does not match the above description, or?
Since this plugin supports autoconf it should work with reasonable defaults without manual intervention. Thus I would suggest one of the following approaches:

  • try to find out, which accounts are using the GPU right now (ideally: persist this list below MUNIN_PLUGSTATE)
  • try to assemble a reasonable list of possible users (e.g. find /home -mindepth 1 -maxdepth 1 -type d -print0 | xargs -0 -r -n 1 basename)

@htakano
Copy link
Author

htakano commented Sep 27, 2018

Thank you for nice suggestions!

Otherwise graph shows users that have been using GPUs only.

Currently gpuUSERS defaults to an empty string. This does not match the above description, or?

It was not correct. How about this comment below?

If env.gpuusers is set, graph always shows listed users
(root, hideki in example) whether using GPU or not.
Otherwise, graph shows users that are using the GPU right now only.

try to find out, which accounts are using the GPU right now (ideally: persist this list below MUNIN_PLUGSTATE)
it should work with reasonable defaults without manual intervention. Thus I would suggest one of the following approaches:

For now, no suitable/portable method is found to list all non-system users automatically,
(find directories right under /home or getent would be at risk of listing thousands of accounts)
so I think it's better to be configurable.

@sumpfralle
Copy link
Collaborator

Otherwise, graph shows users that are using the GPU right now only.

Are you sure, that the plugin currently really does this?
As far as I understand the awk code, it looks like it will print no values at all, if the environment variable is empty. But maybe I misunderstand the code?

For now, no suitable/portable method is found to list all non-system users automatically,
(find directories right under /home or getent would be at risk of listing thousands of accounts)
so I think it's better to be configurable.

That sounds reasonable.
Please just keep in mind, that if the plugin supports autoconf, then it should work without a specific configuration (i.e. the graph should not be empty). I am not sure, whether this is currently its behaviour due to my question above.

@sumpfralle
Copy link
Collaborator

Maybe you want to comment on my last question?

Thank you!

@sumpfralle
Copy link
Collaborator

@htakano: ping?

@github-actions
Copy link

Stale pull request message

@sumpfralle
Copy link
Collaborator

Ping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants