Skip to content

Commit

Permalink
Added troubleshooting tips and diagram (we'll see if we have mermaid.js)
Browse files Browse the repository at this point in the history
  • Loading branch information
tefarley committed Sep 27, 2024
1 parent d0efd17 commit 6d93bc8
Showing 1 changed file with 49 additions and 7 deletions.
56 changes: 49 additions & 7 deletions docs/admin/ondemand.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ After that, you can `git clone` apps to your OOD app developer environment locat
batch_connect:
basic:
script_wrapper: |
source /etc/profile
module purge
%s
set_host: host=$(facter fqdn)
Expand Down Expand Up @@ -103,18 +104,59 @@ openondemand::conf:

```
support_ticket:
data:
support_ticket:
email:
from: "noreply@%{trusted.domain}"
to: [email protected]
data:
support_ticket:
email:
from: "noreply@%{trusted.domain}"
to: [email protected]
```

More about `ondemand.d`, `openondemand::confs`, and their function and format can be found
here: <https://osc.github.io/ood-documentation/latest/reference/files/ondemand-d-ymls.html>
and here: <https://forge.puppet.com/modules/osc/openondemand/>


## Admin Troublshooting

```mermaid
sequenceDiagram
user->>apache: `/var/log/apache2/error.log`
apache->>CAS: `/var/cache/apache2/mod_auth_cas/`
CAS->>apache: return
apache->>pun: `/var/log/apache2/$fqdn_error.log`
pun->>dashboard: `/var/log/ondemand-nginx/$user/error.log`
dashboard->>oodapp: `$home/ondemand/data/sys/dashboard/batch_connect/sys/$app/output/$session_id/output.log`
oodapp->>user: render
```

1. Apache
To start, all users who navigate to the ondemand website first encounter the apache server. Any errors encountered at this step will be in the log(s) at `/var/log/apache2/error.log`

2. CAS
Apache then redirects the users to CAS for authentication. You can `grep -r $user /var/cache/apache2/mod_auth_cas/` to check if users have been authed to CAS and a cookie has been set.

3. Apache part deux
CAS brings us back to apache and here apache runs all sorts of OOD Lua hooks. Any errors encountered at this step will be in the l
og(s) at `/var/log/apache2/$fqdn_error.log`

4. The PUN (Per User Nginx) session
Apache then starts an NginX server as the user and most things like the main dashboard, submitting jobs, running apps, etc happen here in the PUN. Any errors encountered at this step will be in the logs at `/var/log/ondemand-nginx/$user/error.log`. You can also see what might be happening here by running commands like `ps aux | grep $USER` to see the users PUN, or `ps aux | grep -i nginx` to see all the PUNs. From the ondemand web UI theres an option to "Restart Web Server" which essentially kills and restarts the users PUN.

5. /pun/sys/dashboard
The dashboard is mostly covered in section 4, but just wanted to denote that apache then redirects us here after the PUN has been started where users can do everything else. At this step OOD will warn you about things like "Home Directory Not Found" and such. If you get this far I'd recommend you troubleshoot issues with users' home dir, NASii, and free space: `df | grep $HOME`, `du -sh $HOME`, `journalctl -u autofs`, and umount stuff. Check that `$HOME/ondemand` exists perhaps.

6. OOD Apps
When users start an app like JuyterLab or a VNC desktop the job is submitted by the users' PUN and here OOD copies and renders (with ERB) the global app template from `/var/www/ood/apps/sys/<app_name>/template/*` to `$HOME/ondemand/data/sys/dashboard/batch_connect/sys/<app_name>/(output)/<session_id>`. Any errors encountered at this step will be in `$HOME/ondemand/data/sys/dashboard/batch_connect/sys/<app_name>/(output)/<session_id>/*.log`.


## OOD FQDNs

### Farm
#### Production: `ondemand.vm.farm.hpc.ucdavis.edu`
#### Dev: `ondemand-dev.vm.farm.hpc.ucdavis.edu`
#### Production: `ondemand.farm.hpc.ucdavis.edu`
#### Dev: `dood.vm.farm.hpc.ucdavis.edu`

### Franklin
#### Production: `ondemand.franklin.hpc.ucdavis.edu`

### Hive
#### Production: `ondemand.hive.hpc.ucdavis.edu`

0 comments on commit 6d93bc8

Please sign in to comment.