diff --git a/pr-preview/pr-32/admin/ondemand/index.html b/pr-preview/pr-32/admin/ondemand/index.html index 41330ca..bdf211e 100644 --- a/pr-preview/pr-32/admin/ondemand/index.html +++ b/pr-preview/pr-32/admin/ondemand/index.html @@ -2174,7 +2174,7 @@
The dashboard is mostly covered in section 4, but just wanted to denote that apache then redirects us here after the PUN has been started where users can do everything else. At this step OOD will warn you about things like "Home Directory Not Found" and such. If you get this far I'd recommend you troubleshoot issues with users' home dir, NASii, and free space: df | grep $HOME
, du -sh $HOME
, journalctl -u autofs
, and umount stuff. Check that $HOME/ondemand
exists perhaps.
Fast and accurate defocus estimation from electron micrographs.
Versions: 4.1.14
-Arches: amd, intel
+Arches: intel, amd
Modules: ctffind/4.1.14+intel
, ctffind/4.1.14+amd
CUDA is a parallel computing platform and programming model invented by @@ -3834,7 +3834,7 @@
Versions: 11.8.0, 11.7.1
+Versions: 11.7.1, 11.8.0
Arches: generic
Modules: cuda/11.8.0
, cuda/11.7.1
Genome Analysis Toolkit Variant Discovery in High-Throughput Sequencing Data
-Versions: 3.8.1, 4.2.6.1
+Versions: 4.2.6.1, 3.8.1
Arches: generic
Modules: gatk/3.8.1
, gatk/4.2.6.1
A modified version of Relion supporting block-based-reconstruction as @@ -4367,7 +4367,7 @@
Utilities for Relion Cryo-EM data processing on clusters.
-Versions: 0.2, 0.1, 0.3
+Versions: 0.1, 0.2, 0.3
Arches: generic
Modules: relion-helper/0.1
, relion-helper/0.2
, relion-helper/0.3
Welcome to the High-Performance Computing Core Facility (HPCCF) Documentation Site. These pages are intended to be a how-to for commonly asked questions about resources supported by the UC Davis High-Performance Computing Core Facility.
HPCCF is a core faciilty reporting through the Office of Research and supported by individual university colleges, the Office of the Provost, and the Vice Chancellor for Research.
Before contacting HPCCF support, first try searching this documentation. This site provides information on accessing and interacting with HPCCF supported clusters, an overview of available software ecosystems, and tutorials for commonly used software and access patterns. It is split into a Users section for end-users and an Admin section with information relevant to system administrators and advanced users.
Questions about the documentation or resources supported by the HPCCF can be directed to hpc-help@ucdavis.edu.
"},{"location":"#getting-started-with-hpc","title":"Getting Started with HPC","text":"Please read the Supported Services page.
The high-performance computing model at UC Davis starts with a principal investigator (PI) purchasing resources (compute, GPU, storage) and making them available to their lab. HPCCF will assist in onboarding and providing support.
As a new principal investigator who is interested in purchasing resources, please read the Our Clusters section below to determine which clusters are appropriate for onboarding. HPCCF can assist with hardware and storage investments for condo clusters and sell fair-share priority, primary and archive storage for fair-share clusters. Please email hpc-help@ucdavis.edu with your affiliation to start the onboarding process. Resources external to UC Davis can also purchase resources by inquiring at the hpc-help email address.
For getting started with HPC access under an existing PI, please see requesting an account.
"},{"location":"#our-clusters","title":"Our Clusters","text":""},{"location":"#condo-clusters","title":"Condo Clusters","text":"An HPC condo-style cluster is a shared computing infrastructure where different users or groups contribute resources (such as compute nodes, storage, or networking) to a common pool, similar to how individual condo owners share common amenities.
Farm: Sponsored by the College of Agriculture and Environmental Sciences. Farm resources can be purchased by principal investigators regardless of affiliation.
Franklin: Sponsored by units within the College of Biological Sciences, Franklin is open to PI's within the Center for Neuroscience, Microbiology and and Molecular Genetics, Molecular and Cellular Biology, and other approved collaborators.
HPC2: Sponsored by the College of Engineering and Computer Science and is open to principal investigators associated with COE.
Peloton: Peloton is open to principal investigators associated with the College of Letters and Science. Peloton has a shared tier open to users associated with CLAS.
"},{"location":"#fair-share-clusters","title":"Fair-Share Clusters","text":"A fair-share HPC algorithm is a resource allocation strategy used to ensure equitable access to computing resources among multiple users or groups. The goal is to balance the workload and prevent any single user or group from monopolizing the resources.
LSSC0 (Barbera) an HPC shared resource which is coordinated and ran by HPCCF. LSSC0 is ran with a fair-share algorithm.
"},{"location":"#how-to-ask-for-help","title":"How to ask for help","text":"Emails sent to the HPCCF are documented in Service Now via hpc-help@ucdavis.edu. Please include your name, relevant cluster name, account name if possible, and a brief description of your request or question. Please be patient, as HPCCF staff will respond to your inquiry as soon as possible. HPCCF staff are available to respond to requests on scheduled university work days and are available from 8:00 am to 5:00 pm.
"},{"location":"#contributing-to-the-documentation","title":"Contributing to the documentation","text":"This site is written in markdown using MkDocs with the Material for MkDocs theme. If you would like to contribute, you may fork our repo and submit a pull request.
"},{"location":"#additional-information","title":"Additional Information","text":"This section is for HPCCF admins to document our internal infrastructure, processes, and architectures. Although the information may be of interest to end users, it is not designed or maintained for their consumption; nothing written here should be confused as an offering of service. For example, although we describe our Virtual Machine infrastructure, which we used for hosting a variety of production-essential services for our clusters, we do not offer VM hosting for end users.
"},{"location":"admin/cobbler/","title":"Cobbler","text":"HPCCF uses cobbler for provisioning and managing internal DNS.
There is a cobbler server per cluster as well as one for the public HPC VLAN.
cobbler.hpc
- public HPC VLAN.cobbler.hive
- hive private and management VLANscobbler.farm
- farm cobbler.peloton
- pelotoncobbler.franklin
- franklinhpc1
, hpc2
, and lssc0
do not have associated cobbler servers.
cobbler system add --name=<hostname> --profile=infrastructure --netboot=false --interface=default --mac=xx:xx:xx:xx:xx:xx --dns-name=hostname.cluster.hpc.ucdavis.edu --hostname=<hostname> --ip-address=10.11.12.13\n
"},{"location":"admin/configuration/","title":"Configuration Management","text":"ie: puppet
"},{"location":"admin/ddn/","title":"DDN","text":"The DDN provides backend storage for proxmox.
"},{"location":"admin/ddn/#access","title":"Access","text":"The primary means of administration is via the web interface. You will need to be on the HPC VLAN.
"},{"location":"admin/dns/","title":"DNS","text":"DNS is split between internal (what machines on one of the HPCCF VLANs see) vs. external (what the rest of the campus and world sees).
"},{"location":"admin/dns/#external","title":"External","text":"HPCCF uses InfoBlox for public-facing DNS.
"},{"location":"admin/dns/#internal","title":"Internal","text":"Internal DNS is managed by cobbler.
"},{"location":"admin/netbox/","title":"Netbox","text":"HPCCF's Netbox Site is our source of truth for our rack layouts, network addressing, and other infrastructure. NetBox is an infrastructure resource modeling (IRM) application designed to empower network automation. NetBox was developed specifically to address the needs of network and infrastructure engineers.
"},{"location":"admin/netbox/#netbox-features","title":"Netbox Features","text":"This section will give an overview of how HPCCF admins utilize and administer Netbox.
"},{"location":"admin/netbox/#how-to-add-assets-into-netbox","title":"How to add assets into Netbox","text":"Navigate to HPCCF's Netbox instance here: HPCCF's Netbox Site
Select the site to which you will be adding an asset too. In this example I have chosen Campus DC:
Scroll down to the bottom of this page and select which of the locations you will add your asset too, here I chose the Storage Cabinet:
On this page scroll to the bottom and select Add a Device:
After you have selected Add a Device you should see a page like this:
Fill out this page with specifics of the asset, some fields are not required but try to fill out this section as much as possible with the fields available, here is an example of a created asset and how it should look:
Ensure to click on Save to have the device added.
On the asset page select the + Add Components dropdown and select the component you wish to add, for this I have chosen a Console Port:
Here again you will fill out the dropdowns as thoroughly as possible, the example here is of an interface that has already been added:
Again make sure to click Save to ensure the component has been added.
This process can be used to add all of the following componentes to a device:
After a component has been created such as an interface, power port or any other type of component you will want to connect it to something. For any component the process is similar within Netbox. In this example it will show how to connect an Infiniban port on a device to a port on an Infiniban switch. First navigate to the device you wish to work with and select the appropriate tab, in this case it will be Interfaces and you will see a page like this:
Here we will connect ib1 to an infiniban switch by clicking the green dropdown off to the right of ib1 and we will be connecting to another interface on the infiniban switch so we will choose interface as shown here:
Once selected you will come to a screen that looks like this:
Once all filled out with the required information to complete the connection (and any additional information that can be provided) at the bottom make sure to create the connection, your screen should look something like this:
This is meant to be a general configuration guide to Open OnDemand (OOD) for admins. But, I'd also like this to serve as an admin troubleshooting tutorial for OOD. So, the bulk of relevant OOD configs are located in /etc/ood/config/
but the contents within are controlled by puppet. Usually, OOD is served by, or behind, apache and those configs are located in /etc/apache2/
and the default served dir is located at /var/www/ood
but these are also heavily controlled by puppet. For the rest of this config documentation I'll be categorizing by the file names, but I'll also try to refer to the puppet-openondemand class for that file as well.
Apps in OnDemand are located in /var/www/ood/apps/sys/<app name>
. The OOD dashboard itself is considered an app and is located here. The \"dev\" made apps are cloned here by puppet (openondemand::install_apps:
) from hpccf's github (i.e. https://github.com/ucdavis/hpccf-ood-jupyter). OOD apps are, put simply, sbatch scripts that are generated from ERB templates. Inside the app's directory, what is of most interest to admins is the: form.yml
, submit.yml
, and the template/
directory. I would guess that the majority of troubleshooting is happening here. Note that any of the files within this dir can end in .erb
if the you want its contents dynamically generated. To learn more about apps you can find the docs here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive.html
This file represents the form users fill out and the fields for selecting clusters, partitions, cpu, mem, etc. If you wanted to add another field you can do it here. Or, if you suspect there's a bug with the web form I recommend starting here. More about form.yml can be found here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive/form.html#
"},{"location":"admin/ondemand/#submityml","title":"submit.yml","text":"This file contains the contents of the sbatch job as well as job submission parameters that are submitted to slurm (or whatever scheduler you are using). Also, here you can configure the shell environment in which the app is run. If you suspect a bug might be a slurm, slurm submission, or a user environment issue I'd start here. More about submit.yml can be found here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive/submit.html
"},{"location":"admin/ondemand/#template","title":"template/","text":"This directory is the template for the sbatch job that the interactive app is run in. Any code, assets, etc. necessary with the app itself should be included here. When a user launches an OOD app this directory is processed by ERB templating system then copied to ~/ondemand/data/sys/dashboard/batch_connect/sys/...
. In this directory you may see three files of interest to admins: before.sh
, script.sh
, after.sh
. As their names suggest there's a script that runs before the job, one after, and the job itself. OOD starts by running the main script influenced by submit.yml
and forks thrice to run the before.sh
, script.sh
, and after.sh
. More about template/
can be found here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive/template.html
This is just the html view of the app form. I doubt you need to be editing this.
"},{"location":"admin/ondemand/#manifestyml","title":"manifest.yml","text":"This is where you set the app's name that shows on the dashboard and the app's category. More about manifest.yml
can be found here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive/manifest.html
If you want to edit, add, or create OOD apps you must be enabled as a dev app developer. In puppet this is done by placing your username under openondemand::dev_app_users:
and puppet will then do the following:
mkdir -p /var/www/ood/apps/dev/<username>\nsudo ln -s ~/ondemand/dev /var/www/ood/apps/dev/<username>/gateway\n
After that, you can git clone
apps to your OOD app developer environment located in ~/ondemand/dev/
. Your dev apps will show in a separate sidebar from the production ood apps and won't be visible by anyone else unless shared."},{"location":"admin/ondemand/#clustersd","title":"clusters.d","text":"/etc/ood/config/clusters.d/
is the config dir where OOD is coupled with a cluster and global scheduler options are specified. For OOD apps to work and be submitted to a cluster this yaml needs to be present and must be named after the cluster's hostname i.e. /etc/ood/config/clusters.d/farm.yml
. This area is controlled by puppet under openondemand::clusters:
. The most relevant section of this file for people not named Teddy is batch_connect:
, and more specifically the script_wrapper:
, is where you can put shell commands that will always run when an OOD app is ran.
batch_connect:\n basic:\n script_wrapper: |\n source /etc/profile\n module purge\n %s\n set_host: host=$(facter fqdn)\n vnc:\n script_wrapper: |\n source /etc/profile\n module purge\n module load conda/websockify turbovnc\n export WEBSOCKIFY_CMD=\"websockify\"\n turbovnc-ood\n %s\n set_host: host=$(facter fqdn)\n
"},{"location":"admin/ondemand/#script-wrapper","title":"Script Wrapper","text":"Under batch_connect:
is the script wrappers listed by the parent app category. Apps like JupyterLab and RStudio are in the basic
category, and VNC has its own category. Anything set in the script_wrapper:
under the app category is always run when an app of that category is run. So if you add a module load openmpi
to the script wrapper under basic:
then that will be ran, and openmpi will be loaded, whenever RStudio or JupyterLab is started. The %s
is a placeholder for all the scripts from the aforementioned template/
dir . You can use the placeholder to differentiate whether you want your commands to be run before or after your OOD app is started.
The facter fqdn
within set_host:
key should resolve to the fqdn of the compute/gpu node the job is running on.
More about clusters.d/
can be found here: https://osc.github.io/ood-documentation/latest/installation/add-cluster-config.html
/etc/ood/config/ood_portal.yml
is the top most config for OOD. Here be dragons, don't edit this file unless you know what you are doing! Here you can set the server name and port number that OOD will listen on. As well as, OOD related apache configs, certs, proxies, CAS confs, root uri, node uri, logout uri, etc.
Once a user authenticates with OOD, apache then starts the PUN as the user. /etc/ood/config/nginx_stage.yml
determines all the properties of the PUN including global settings for every user's shell env. If you suspect a bug is a user shell env problem, first check the local app env configs set in: submit.yml
in the app's directory first. More about nginx_stage.yml can be found here: https://osc.github.io/ood-documentation/latest/reference/files/nginx-stage-yml.html
You can make an announcement to be displayed within a banner on OOD by creating a yml or md file in /etc/ood/config/announcements.d/
. When any user navigates to OOD's dashboard, OOD will check here for the existence of any files.
Here's an example announcement yaml:
type: warning\nmsg: |\nOn Monday, September 24 from 8:00am to 12:00pm there will be a **Maintenece downtime**, which will prevent SSH login to compute nodes and running OnDemand \n
You can also create an test-announcement.yml.erb
to take advantage of ERB ruby templating. More about OOD announcements can be found here: https://osc.github.io/ood-documentation/latest/customizations.html#announcements
You can have the OOD dashboard display the system MOTD by setting these environment variables:
MOTD_PATH=\"/etc/motd\" # this supports both file and RSS feed URIs\nMOTD_FORMAT=\"txt\" # markdown, txt, rss, markdown_erb, txt_erb\n
In /etc/ood/config/apps/dashboard/env
/etc/ood/config/ondemand.d/
is home to nearly all other OOD configs not mentioned here (i.e. ticket submission, nav customizations, branding, etc.). The contents are controlled by puppet, under openondemand::confs:
, and the puppet formatting to properly place yamls here is as follows:
openondemand::conf:\n<name of yaml (i.e. tickets; If you want to create a tickets.yml)>\n data: (denotes the content to put in yaml)\n <yaml key>: <yaml value>\n
support_ticket:\n data:\n support_ticket:\n email:\n from: \"noreply@%{trusted.domain}\"\n to: hpc-help@ucdavis.edu\n
More about ondemand.d
, openondemand::confs
, and their function and format can be found here: https://osc.github.io/ood-documentation/latest/reference/files/ondemand-d-ymls.html and here: https://forge.puppet.com/modules/osc/openondemand/
sequenceDiagram\n user->>apache: `/var/log/apache2/error.log`\n apache->>CAS: `/var/cache/apache2/mod_auth_cas/`\n CAS->>apache: return\n apache->>pun: `/var/log/apache2/$fqdn_error.log`\n pun->>dashboard: `/var/log/ondemand-nginx/$user/error.log`\n dashboard->>oodapp: `$home/ondemand/data/sys/dashboard/batch_connect/sys/$app/output/$session_id/output.log`\n oodapp->>user: render\n
To start, all users who navigate to the ondemand website first encounter the apache server. Any errors encountered at this step will be in the log(s) at /var/log/apache2/error.log
Apache then redirects the users to CAS for authentication. You can grep -r $user /var/cache/apache2/mod_auth_cas/
to check if users have been authed to CAS and a cookie has been set.
CAS brings us back to apache and here apache runs all sorts of OOD Lua hooks. Any errors encountered at this step will be in the l og(s) at /var/log/apache2/$fqdn_error.log
Apache then starts an NginX server as the user and most things like the main dashboard, submitting jobs, running apps, etc happen here in the PUN. Any errors encountered at this step will be in the logs at /var/log/ondemand-nginx/$user/error.log
. You can also see what might be happening here by running commands like ps aux | grep $USER
to see the users PUN, or ps aux | grep -i nginx
to see all the PUNs. From the ondemand web UI theres an option to \"Restart Web Server\" which essentially kills and restarts the users PUN.
The dashboard is mostly covered in section 4, but just wanted to denote that apache then redirects us here after the PUN has been started where users can do everything else. At this step OOD will warn you about things like \"Home Directory Not Found\" and such. If you get this far I'd recommend you troubleshoot issues with users' home dir, NASii, and free space: df | grep $HOME
, du -sh $HOME
, journalctl -u autofs
, and umount stuff. Check that $HOME/ondemand
exists perhaps.
When users start an app like JuyterLab or a VNC desktop the job is submitted by the users' PUN and here OOD copies and renders (with ERB) the global app template from /var/www/ood/apps/sys/<app_name>/template/*
to $HOME/ondemand/data/sys/dashboard/batch_connect/sys/<app_name>/(output)/<session_id>
. Any errors encountered at this step will be in $HOME/ondemand/data/sys/dashboard/batch_connect/sys/<app_name>/(output)/<session_id>/*.log
.
Maybe the ondemand server is just in some invalid state and needs to be reset. I'd recommend you check the puppet conf at /etc/puppetlabs/puppet/puppet.conf
, run puppet agent -t
, and maybe restart the machine. Running puppet will force restart the apache server and regenerate OOD from the ood config yamls. Then you can restart the server by either ssh-ing to the server and running reboot
, or by ssh-ing to proxmox and running qm reset <vmid>
as root. TIP: you can find the vmid by finding the server in qm list
.
ondemand.farm.hpc.ucdavis.edu
","text":""},{"location":"admin/ondemand/#dev-doodvmfarmhpcucdavisedu","title":"Dev: dood.vm.farm.hpc.ucdavis.edu
","text":""},{"location":"admin/ondemand/#franklin","title":"Franklin","text":""},{"location":"admin/ondemand/#production-ondemandfranklinhpcucdavisedu","title":"Production: ondemand.franklin.hpc.ucdavis.edu
","text":""},{"location":"admin/ondemand/#hive","title":"Hive","text":""},{"location":"admin/ondemand/#production-ondemandhivehpcucdavisedu","title":"Production: ondemand.hive.hpc.ucdavis.edu
","text":""},{"location":"admin/provisioning/","title":"Provisioning","text":"(cobbler, etc)
"},{"location":"admin/software/","title":"Software Deployment","text":""},{"location":"admin/software/#spack","title":"Spack","text":""},{"location":"admin/software/#conda","title":"Conda","text":""},{"location":"admin/software/#other","title":"Other","text":""},{"location":"admin/vms/","title":"Virtual Machines","text":"HPCCF uses Proxmox for virtualization. Current servers are proxmox1
, proxmox2
, and proxmox3
.
To log in, point your browser to port 8006
on any of the proxmox servers, and choose UCD-CAS
as the realm. You'll need to be on the HPC VLAN to access the interface.
Use Netbox to locate a free IP address, or allocate one in the appropriate cobbler server. See provisioning for more information on selecting an IP/hostname and setting up PXE.
"},{"location":"admin/vms/#general","title":"General","text":"Choose an unused VM ID. Storage areas are pre-created on the DDN, on directory per VM ID. If more need to be created, see the DDN documentation. Populate the \"Name\" field with your chosen VM name.
"},{"location":"admin/vms/#os","title":"OS","text":"If you're installing a machine via PXE from a cobbler server, choose \"Do not use any media.\"
To add a new ISO, copy it to /mnt/pve/DDN-ISOs/template/iso/
on one of the proxmox hosts.
Check the Qemu Agent
box.
Defaults are fine. Adjust disk size as needed.
"},{"location":"admin/vms/#cpu","title":"CPU","text":"Use type x86-64-v3
. Adjust cores to taste.
Recent Ubuntu installer will fail unless you use at least 4096.
"},{"location":"admin/vms/#network","title":"Network","text":"See Netbox for a list of vlans.
Make sure to select VirtIO (paravirtualized)
for the network type.
Do not forget to add to DNS.
If this is a production VM, add the \"production\" tag.
"},{"location":"farm/","title":"Farm","text":"Farm is a Linux-based supercomputing cluster for the College of Agricultural and Environmental Sciences at UC Davis. Designed for both research and teaching, it is a significant campus resource primarily for CPU and RAM-based computing, with a wide selection of centrally-managed software available for research in genetics, proteomics, and related bioinformatics pipelines, weather and environmental modeling, fluid and particle simulations, geographic information system (GIS) software, and more.
For buying in resources in Farm cluster, contact CEAS IT director Adam Getchell - acgetchell@ucdavis.edu
"},{"location":"farm/#farm-hardware","title":"Farm Hardware","text":"Farm is an evolving cluster that changes and grows to meet the current needs of researchers, and has undergone three phases, with Farm III as the most recent evolution.
Farm III consists of 32 parallel nodes with up to 64 CPUs and 256GB RAM each in low2/med2/high2, plus 17 \u201cbigmem\u201d nodes with up to 96 CPUs and 1TB RAM each in the bml/bmm/bmh queue. All Farm III bigmem and newer parallel nodes and storage are on EDR/100Gbit interconnects. Older parallel nodes and storage are on FDR/55Gbit.
Farm II consists of 95 parallel nodes with 24 CPUs and 64GB RAM each in low/med/high, plus 9 \u201cbigmem\u201d nodes with 64 CPUs and 512GB RAM each in the bigmeml/bigmemm/bigmemh queues, and 1 additional node with 96 CPUs and 1TB RAM. Farm II nodes are on QDR/32Gbit interconnects.
Hardware from both Farm II and Farm III are still in service; Farm I has been decommissioned as of 2014.
Farm also has multiple file servers with over 5.3PB of storage space in total.
"},{"location":"farm/#access-to-farm","title":"Access to Farm","text":"All researchers in CA&ES are entitled to free access to:
8 nodes with 24 CPUs and 64GB RAM each (up to a maximum of 192 CPUs and 512GB RAM) in Farm II\u2019s low, medium, and high-priority batch queues,
4 nodes with 352 CPUs and 768GB RAM each in Farm III's low2, med2, and high2-priority batch queues.
The bml (bigmem, low priority/requeue) partition, which has 24 nodes with a combined 60 TB of RAM.
In addition to this, each new user is allocated a 20GB home directory. If you want to use the CA&ES free tier, select \u201cCA&ES free tier\" from the list of sponsors here.
Additional usage and access may be purchased by contributing to Farm III by through the node and/or storage rates or by purchasing equipment and contributing through the rack fee rate.
Contributors always receive priority access to the resources that they have purchased within one minute with the \u201cone-minute guarantee.\u201d Users can also request additional unused resources on a \u201cfair share\u201d basis in the medium or low partitions.
"},{"location":"farm/#farm-administration","title":"Farm Administration","text":"Farm hardware and software are administrated by the HPC Core Facility Team.
"},{"location":"farm/#current-rates","title":"Current Rates","text":"As of October 2023, the rates for Farm III:
Node and Storage Rates (each buy-in guarantees access for 5 years): -
Equipment may be purchased directly by researchers based on actual cost. Equipment quote available upon request.
Sponsor - CAES Information about Partitions - what is low, med and high and what are available GPUs on Farm?
Free tier access to 20GB capped storage.
Low partition - Internittent access to idle resources abbove limit
Medium Partition - Shared use of idle resources above permitted limit
High Partition - Dedicated use of invested resource.
CPU threads - 15,384
GPU count - 29
Aggregated RAM - 66 TB
Maximum RAM per node - 2TB
Node Count - 202
Inter-Connect - 200Gbps
Total number of Users - 726/328
"},{"location":"franklin/","title":"Franklin","text":"Franklin is a high performance computing (HPC) cluster for the College of Biological Sciences at UC Davis. Its primary use is for research in genetics, genomics, and proteomics, structural biology via cryogenic electron microscopy, computational neuroscience, and generally, the computational biology workflows related to those fields. Franklin currently consists of 7 AMD CPU nodes each with 128 physical and 256 logical cores and 1TB of RAM, 9 GPU nodes with a total of 72 Nvidia RTX A4000, RTX A5000, RTX 6000 Ada, and RTX 2080 TI GPUs, and a collection of ZFS file servers providing approximately 3PB of storage.
"},{"location":"franklin/scheduling/","title":"Job Scheduling","text":""},{"location":"franklin/scheduling/#partitions","title":"Partitions","text":""},{"location":"franklin/scheduling/#quality-of-service","title":"Quality of Service","text":""},{"location":"franklin/storage/","title":"Storage","text":""},{"location":"franklin/storage/#home-directories","title":"Home Directories","text":"All users are allocated 20GB of storage for their home directory. This space is free and not associated with lab storage quotas.
"},{"location":"franklin/storage/#lab-storage-allocations","title":"Lab Storage Allocations","text":"Research data should be stored on lab storage allocations. These allocations are mounted at /group/[PI_GROUP_NAME]
. N ote that these directories are mounted as-needed, so your particular allocation might not show up when you run ls /group
; you will need to access the path directly. You can find your PI group name by running groups
: this will output your user name and a name ending in grp
. The latter corresponds to the directory name under /group
, unless otherwise requested by your PI.
Franklin has a deployment of deepmind's alphafold as well as its databases compiled from source1. The databases are located at /share/databases/alphafold
; this directory is exported as $ALPHAFOLD_DB_ROOT
when the module is loaded.
This is not a docker deployment. As such, many of the examples provided online need to be slightly modified. The main script supplied by the alphafold package is run_alphafold.py
, which is the script that the docker container calls internally. All the same command line arguments that can be passed to alphafold's run_docker.py
script can be passed to run_alphafold.py
, but the latter requires all the database locations be supplied:
run_alphafold.py \\\n--data_dir=\"$ALPHAFOLD_DB_ROOT\" \\\n--uniref90_database_path=$ALPHAFOLD_DB_ROOT/uniref90/uniref90.fasta \\\n--mgnify_database_path=$ALPHAFOLD_DB_ROOT/mgnify/mgy_clusters_2022_05.fa \\\n--template_mmcif_dir=$ALPHAFOLD_DB_ROOT/pdb_mmcif/mmcif_files \\\n--obsolete_pdbs_path=$ALPHAFOLD_DB_ROOT/pdb_mmcif/obsolete.dat \\\n--bfd_database_path=$ALPHAFOLD_DB_ROOT/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n--uniref30_database_path=$ALPHAFOLD_DB_ROOT/uniref30/UniRef30_2021_03 \\\n--pdb70_database_path=$ALPHAFOLD_DB_ROOT/pdb70/pdb70 \\\n[OTHER ARGS]\n
Because this is annoying, we have supplied a wrapper script named alphafold-wrapped
with our module that passes these common options for you. Any of the arguments not passed above will be passed along to the run_alphafold.py
script; for example:
alphafold-wrapped \\\n--output_dir=[OUTPUTS] \\\n--fasta_paths=[FASTA INPUTS] \\\n--max_template_date=2021-11-01 \\\n--use_gpu_relax=true\n
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., \u2026 Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Springer Science and Business Media LLC. https://doi.org/10.1038/s41586-021-03819-2 \u21a9
Franklin has multiple CPU and GPU optimized versions of the Relion cryo-EM structural determination package. The head node has been configured to support X11 forwarding, enabling the Relion GUI to be launched. Relion jobs are submitted for batch processing on the cluster node via Slurm. Each Relion module exports the necessary configurations to pre-fill job submission and dependency information in the GUI, and we have defined additional GUI fields to further configure Slurm parameters. We are also maintaining an additional software package, relion-helper
, to assist users in switching between Relion modules within the same project.
Your first step is deciding which Relion variant you should use. We recommend version 4.0.0, as it is the currently-supported stable release. There are three variants of this version: relion/cpu/4.0.0+amd
, relion/gpu/4.0.0+amd
, and relion/gpu/4.0.0+intel
, which correspond to the CPU optimized, GPU with AMD CPU optimized, and GPU with Intel CPU optimized builds, respectively. More information about these modules is available in the Module Variants section. In general, unless you have access to the three GPU nodes owned by the Al-Bassam lab, you can ignore the Intel variants, and use the CPU +amd
version for multi-node CPU only jobs and the GPU +amd
version if you have access to a GPU node.
If you are completely unfamiliar with Relion, you should start with the tutorial.
Note
Because Relion is GUI driven, you need to ssh
to Franklin with X11 forwarding enabled. Instructions for enabling X11 forwarding can be found in the Access section.
Make sure you have loaded one of the Relion modules:
$ module list relion\n\nCurrently Loaded Modules Matching: relion\n 1) relion/gpu/4.0.0+amd 2) relion-helper/0.2\n
Change your working directory your Relion project directory and type relion
. The Relion GUI should then pop up locally. There will be a bit of latency when using it, especially if you are off campus. You may be able to reduce latency by enabling SSH compression.
The relion start screen."},{"location":"franklin/software/cryoem/#dependency-configuration","title":"Dependency Configuration","text":"
The paths to software that different Relion jobs use will be automatically filled in. Editing these paths, unless you really, really know what you are doing, is not recommended and will likely result in problems, as some of these dependencies are compiled with architecture-specific flags that match their Relion variant.
Pre-filled dependent program path.
Danger
If you plan to switch between Relion modules within the same project, you must use the procedure described in the relion-helper section. Failure to do so will result in execution errors.
"},{"location":"franklin/software/cryoem/#slurm-configuration","title":"Slurm Configuration","text":"Our Relion deployment has additional fields in the Running tabs. These new fields are:
--mail-user
sbatch
/srun
parameter.--memory-per-cpu
parameter. Total RAM use of a job will be (Number of MPI procs) * (Number of Threads) * (Memory per CPU), when the Number of Threads field is available; otherwise it will be (Number of MPI procs) * (Memory per CPU).--time
parameter.TYPE:NUM
is supplied (example: a4000:4
), specific models of GPU will be requested. See the Resources section for more information on available GPU types. The relion/cpu
modules lack the GPU resources field. Note the submission script as well.
The relion/gpu
module has an extra field for GPU resources. Also note the differing submission script.
The default GUI fields serve their original purposes:
--ntasks
parameter. These tasks may be distributed across multiple nodes, depending on the number of Threads requested. For GPU runs, this should be the number of GPUs + 1.--cpus-per-task
parameter, which means it is the number of threads per MPI proc. Some job types do not expose this field, as they can only be run with a single-thread per MPI proc.--partition
parameter. More information on partitions can be found in the Queueing section./share/apps/spack/templates/hpccf/franklin
or in our spack GitHub repo.Sometimes, you may wish to use different Relion modules for different tasks while working within the same project -- perhaps you'd prefer to use the CPU-optimized version for CTF estimation and the GPU-optimized version for 3D refinement. This does not work out of the box. Relion fills the filesystem paths of its dependencies and templates from environment variables, and those environment variables are set in the modulefiles of the differing Relion builds. However, when a Relion job is run, those paths are cached in hidden .star
files in the project directory, and the next time Relion is run, it fills those paths from the cache files instead of the environment variables. This means that, after switching modules, the cached location of the previous module will be used, instead of the exported environment variables from the new module. This causes major breakage due to dependencies having different compilation options to match the parent Relion they are attached to and Slurm templates having different configuration options available.
Luckily, we have a solution! We wrote and are maintaining relion-helper, a simple utility that updates the cached variables in a project to match whatever Relion module is currently loaded. Let's go over example use of the tool.
In this example, assume we have a relion project directory at /path/to/my/project
. We ran some steps with the module relion/gpu/4.0.0+amd
, and now want to switch to relion/cpu/4.0.0+amd
. First, let's swap modules:
$ module unload relion/gpu/4.0.0+amd \namdfftw/3.2+amd: unloaded.\nctffind/4.1.14+amd: unloaded.\nrelion/gpu/4.0.0+amd: unloaded.\nmotioncor2/1.5.0: unloaded.\ngctf/1.06: unloaded.\nghostscript/9.56.1: unloaded.\n\n$ module load relion/cpu/4.0.0+amd.lua \namdfftw/3.2+amd: loaded.\nctffind/4.1.14+amd: loaded.\nrelion/cpu/4.0.0+amd: loaded.\nmotioncor2/1.5.0: loaded.\ngctf/1.06: loaded.\nghostscript/9.56.1: loaded.\n
And load relion-helper:
$ module load relion-helper \nrelion-helper/0.2: loaded.\n\n$ relion-helper -h\nusage: relion-helper [-h] {reset-cache} ...\n\npositional arguments:\n {reset-cache}\n\noptions:\n -h, --help show this help message and exit\n
Now, change to the project directory:
$ cd /path/to/my/project\n
Then, run the utility. It will pull the updated values from the appropriate environment variables that were exported by the new module and write them to the cache files in-place.
$ relion-helper reset-cache\n> .gui_ctffindjob.star:41:\n qsub_extra2: 2 => 10000\n> .gui_ctffindjob.star:42:\n qsub_extra3: 10000 => 12:00:00\n> .gui_ctffindjob.star:43:\n qsubscript: /share/apps/spack/templates/hpccf/franklin/relion.4.0.0.gpu.zen2.slurm.template.sh => \n/share/apps/spack/templates/hpccf/franklin/relion.4.0.0.cpu.slurm.template.sh\n> .gui_class2djob.star:53:\n qsub_extra2: 2 => 10000\n> .gui_class2djob.star:54:\n qsub_extra3: 10000 => 12:00:00\n> .gui_class2djob.star:55:\n qsubscript: /share/apps/spack/templates/hpccf/franklin/relion.4.0.0.gpu.zen2.slurm.template.sh => \n/share/apps/spack/templates/hpccf/franklin/relion.4.0.0.cpu.slurm.template.sh\n> .gui_autopickjob.star:63:\n qsub_extra2: 2 => 10000\n> .gui_autopickjob.star:64:\n qsub_extra3: 10000 => 12:00:00\n> .gui_autopickjob.star:65:\n qsubscript: /share/apps/spack/templates/hpccf/franklin/relion.4.0.0.gpu.zen2.slurm.template.sh => \n/share/apps/spack/templates/hpccf/franklin/relion.4.0.0.cpu.slurm.template.sh\n> .gui_importjob.star:38:\n qsub_extra2: 2 => 10000\n...\n
The above output is truncated for brevity. For each cached variable it updates, it reports the name of the cache file, the line number of the change, and the variable name and value of the change. You can now launch Relion and continue with your work.
Each time you want to switch Relion modules for a project, you will need to run this after loading the new module.
For now, relion-helper only has the reset-cache
subcommand. You can skip cd
ing to the project directory by passing the project directory to it instead:
$ relion-helper reset-cache -p /path/to/my/project\n
Although the changes are made in-place, it leaves backups of the modified files, in case you are concerned about bugs. The original files are of the form .gui_[JOBNAME].star
, and the backups are suffixed with .bak
:
$ ls -al /path/to/my/project\ntotal 317\ndrwxrwxr-x 10 camw camw 31 Feb 3 10:02 .\ndrwxrwxr-x 4 camw camw 6 Jan 12 12:58 ..\ndrwxrwxr-x 5 camw camw 5 Jan 12 12:46 .Nodes\ndrwxrwxr-x 2 camw camw 2 Jan 12 12:40 .TMP_runfiles\n-rw-rw-r-- 1 camw camw 1959 Feb 3 10:02 .gui_autopickjob.star\n-rw-rw-r-- 1 camw camw 1957 Feb 3 10:01 .gui_autopickjob.star.bak\n-rw-rw-r-- 1 camw camw 1427 Feb 3 10:02 .gui_class2djob.star\n-rw-rw-r-- 1 camw camw 1425 Feb 3 10:01 .gui_class2djob.star.bak\n-rw-rw-r-- 1 camw camw 1430 Feb 3 10:02 .gui_ctffindjob.star\n-rw-rw-r-- 1 camw camw 1428 Feb 3 10:01 .gui_ctffindjob.star.bak\n...\n
Warning
We do not recommend changing between major Relion versions within the same project: ie, from 3.0.1 to 4.0.0.
"},{"location":"franklin/software/cryoem/#module-variants","title":"Module Variants","text":"There are currently six variations of Relion available on Franklin. Versions 3.1.3 and 4.0.0 are available, each with:
relion/cpu/[VERSION]+amd
relion/gpu/[VERSION]+amd
relion/gpu/[VERSION]+intel
The CPU-optimized builds were configured with -DALTCPU=True
and without CUDA support. For Relion CPU jobs, they will be much faster than the GPU variants. The AMD-optimized +amd
variants were compiled with -DAMDFFTW=ON
and linked against the amdfftw
implementation of FFTW
, in addition to having Zen 2 microarchitecture flags specified to GCC. The +intel
variants were compiled with AVX2 support and configured with the -DMKLFFT=True
flag, so they use the Intel OneAPI MKL implementation of FFTW
. All the GPU variants are targeted to a CUDA compute version of 7.5. The full Cryo-EM software stack is defined in the HPCCF spack configuration repository, and we maintain our own Relion spack package definition. More information on the configurations described here can be found in the Relion docs.
The different modules may need to be used with different Slurm resource directives, depending on their variants. The necessary directives, given a module and job partition, are as follows:
Module Name Slurm Partition Slurm Directivesrelion/cpu/[3.1.3,4.0.0]+amd
low
--constraint=amd
relion/cpu/[3.1.3,4.0.0]+amd
high
N/A relion/gpu/[3.1.3,4.0.0]+amd
low
--constraint=amd --gres=gpu:[$N_GPUs]
or --gres=gpu:[a4000,a5000]:[$N_GPUs]
relion/gpu/[3.1.3,4.0.0]+amd
jalettsgrp-gpu
--gres=gpu:[$N_GPUs]
relion/gpu/[3.1.3,4.0.0]+amd
mmgdept-gpu
--gres=gpu:[$N_GPUs]
relion/gpu/[3.1.3,4.0.0]+intel
low
--constraint=intel --gres=gpu:[$N_GPUs]
or --gres=gpu:[rtx_2080_ti]:[$N_GPUs]
relion/gpu/[3.1.3,4.0.0]+intel
jawdatgrp-gpu
--gres=gpu:[$N_GPUs]
For example, to use the CPU-optimized Relion module relion/cpu/4.0.0+amd
on the free, preemptable low
partition, you should submit jobs with --constraint=amd
so as to eliminate the Intel nodes in that partition from consideration. However, if you have access to and are using the high
partition with the same module, no additional Slurm directives are required, as the high
partition only has CPU compute nodes. Alternatively, if you were using an AMD-optimized GPU version, like relion/gpu/4.0.0+amd
, and wished to use 2 GPUs on the low
partition, you would need to provide both the --constraint=amd
and a --gres=gpu:2
directive, in order to get an AMD node on the partition along with the required GPUs. Those with access to and submitting to the mmgdept-gpu
queue would need only to specify --gres=gpu:2
, as that partition only has AMD nodes in it.
Note
If you are submitting jobs via the GUI, these Slurm directives will already be taken care of for you. If you wish to submit jobs manually, you can get the path to Slurm submission template for the currently-loaded module from the $RELION_QSUB_TEMPLATE
environment variable; copying this template is a good starting place for building your batch scripts.
Our installation of CTFFIND4 has +amd
and +intel
variants which, like Relion, are linked against amdfftw
and Intel OneAPI MKL, respectively. The Slurm --constraint
flags should be used with these as well, when appropriate, as indicated in the Relion directive table. Each Relion module has its companion CTFFIND4 module as a dependency, so the appropriate version will automatically be loaded when you load Relion, and the proper environment variables are set for the Relion GUI to point at them.
We have deployed MotionCor2 binaries which have been patched to link against the appropriate version of CUDA. These are targetted at a generic architecture, as the source code is not available. Like CTFFIND4, this module is brought in by Relion and the proper environment variables set for Relion to use it.
"},{"location":"franklin/software/cryoem/#gctf","title":"Gctf","text":"Gctf binaries have been patched and deployed in the same manner as MotionCor2.
"},{"location":"franklin/software/modules/","title":"Modules","text":"Franklin currently uses lmod
, which is cross-compatible with envmod
. See our lmod
docuementation for more information on using the module system.
Many modules correspond to different versions of the same software, and some software has multiple variants of the same version. The default naming convention is NAME/VERSION
: for example, cuda/11.8.0
or mcl/14-137
. The version can be omitted when loading, in which case the highest-versioned module or the version marked as default (with a (D)
) will be used.
Some module names are structured as NAME/VARIANT/VERSION
. For these, the minimum name you can use for loading is NAME/VARIANT
: for example, you can load relion/gpu
or relion/cpu
, but just trying to module load relion
will fail.
Software is sometimes compiled with optimizations specific to certain hardware. These are named with the format NAME/VERSION+ARCH
or NAME/VARIANT/VERSION+arch
. For example, ctffind/4.1.14+amd
was compiled with AMD Zen2-specific optimizations and uses the amdfftw
implementation of the FFTW
library, and will fail on the Intel-based RTX2080 nodes purchased by the Al-Bassam lab (gpu-9-[10,18,26]
). Conversely, ctffind/4.1.14+intel
was compiled with Intel-specific compiler optimizations as well as linking against the Intel OneAPI MKL implementation of FFTW
, and is only meant to be used on those nodes. In all cases, the +amd
variant of a module, if it exists, is the default, as the majority of the nodes use AMD CPUs.
Software without a +ARCH
was compiled for a generic architecture and will function on all nodes. The generic architecture on Franklin is x86-64-v3
, which means they support AVX
, AVX2
, and all other previous SSE
and other vectorized instructions.
The various conda modules have their own naming scheme. These are of the form conda/ENVIRONMENT/VERSION
. The conda/base/VERSION
module(s) load the base conda environment and set the appropriate variables to use the conda activate
and deactivate
commands, while the the modules for the other environments first load conda/base
and then activate the environment to which they correspond. The the conda
section for more information on conda
and Python on Franklin.
These modules are built and managed by our Spack deployment. Most were compiled for generic architecture, meaning they can run on any node, but some are Intel or AMD specific, and some require GPU support.
"},{"location":"franklin/software/modules/#r","title":"R","text":"R is 'GNU S', a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information.
Versions: 4.1.1
Arches: generic
Modules: R/4.1.1
ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size.
Versions: 2.3.1
Arches: generic
Modules: abyss/2.3.1
FFTW (AMD Optimized version) is a comprehensive collection of fast C routines for computing the Discrete Fourier Transform (DFT) and various special cases thereof. It is an open-source implementation of the Fast Fourier transform algorithm. It can compute transforms of real and complex-values arrays of arbitrary size and dimension. AMD Optimized FFTW is the optimized FFTW implementation targeted for AMD CPUs. For single precision build, please use precision value as float. Example : spack install amdfftw precision=float
Versions: 3.2
Arches: amd
Modules: amdfftw/3.2+amd
Apache Ant is a Java library and command-line tool whose mission is to drive processes described in build files as targets and extension points dependent upon each other
Versions: 1.10.7
Arches: generic
Modules: ant/1.10.7
ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences.
Versions: 1.2.38
Arches: generic
Modules: aragorn/1.2.38
Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome.
Versions: 2.30.0
Arches: generic
Modules: bedtools2/2.30.0
Basic Local Alignment Search Tool.
Versions: 2.12.0
Arches: generic
Modules: blast-plus/2.12.0
Blast2GO is a bioinformatics platform for high-quality functional annotation and analysis of genomic datasets.
Versions: 5.2.5
Arches: generic
Modules: blast2go/5.2.5
BLAT (BLAST-like alignment tool) is a pairwise sequence alignment algorithm.
Versions: 35
Arches: generic
Modules: blat/35
Bowtie is an ultrafast, memory-efficient short read aligner for short DNA sequences (reads) from next-gen sequencers.
Versions: 1.3.0
Arches: generic
Modules: bowtie/1.3.0
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences
Versions: 2.4.2
Arches: generic
Modules: bowtie2/2.4.2
Burrow-Wheeler Aligner for pairwise alignment between DNA sequences.
Versions: 0.7.17
Arches: generic
Modules: bwa/0.7.17
bwtool is a command-line utility for bigWig files.
Versions: 1.0
Arches: generic
Modules: bwtool/1.0
A single molecule sequence assembler for genomes large and small.
Versions: 2.2
Arches: generic
Modules: canu/2.2
CAP3 is DNA Sequence Assembly Program
Versions: 2015-02-11
Arches: generic
Modules: cap3/2015-02-11
Clustal Omega: the last alignment program you'll ever need.
Versions: 1.2.4
Arches: generic
Modules: clustal-omega/1.2.4
Multiple alignment of nucleic acid and protein sequences.
Versions: 2.1
Arches: generic
Modules: clustalw/2.1
Corset is a command-line software program to go from a de novo transcriptome assembly to gene-level counts.
Versions: 1.09
Arches: generic
Modules: corset/1.09
Fast and accurate defocus estimation from electron micrographs.
Versions: 4.1.14
Arches: amd, intel
Modules: ctffind/4.1.14+intel
, ctffind/4.1.14+amd
CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). Note: This package does not currently install the drivers necessary to run CUDA. These will need to be installed manually. See: https://docs.nvidia.com/cuda/ for details.
Versions: 11.8.0, 11.7.1
Arches: generic
Modules: cuda/11.8.0
, cuda/11.7.1
Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples.
Versions: 2.2.1
Arches: generic
Modules: cufflinks/2.2.1
Command-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc. Primarily written to support an Illumina based pipeline - but should work with any FASTQs.
Versions: 2021-10-20
Arches: generic
Modules: ea-utils/2021-10-20
EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community
Versions: 6.6.0
Arches: generic
Modules: emboss/6.6.0
Pairwise sequence alignment of DNA and proteins
Versions: 2.4.0
Arches: generic
Modules: exonerate/2.4.0
This is an exonerate fork with added gff3 support. Original website with user guides: http://www.ebi.ac.uk/~guy/exonerate/
Versions: 2.3.0
Arches: generic
Modules: exonerate-gff3/2.3.0
A quality control tool for high throughput sequence data.
Versions: 0.11.9
Arches: generic
Modules: fastqc/0.11.9
FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST). We believe that FFTW, which is free software, should become the FFT library of choice for most applications.
Versions: 3.3.10
Arches: generic
Modules: fftw/3.3.10
Bayesian haplotype-based genetic polymorphism discovery and genotyping.
Versions: 1.3.6
Arches: generic
Modules: freebayes/1.3.6
Genome Analysis Toolkit Variant Discovery in High-Throughput Sequencing Data
Versions: 3.8.1, 4.2.6.1
Arches: generic
Modules: gatk/3.8.1
, gatk/4.2.6.1
The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Ada, and Go, as well as libraries for these languages.
Versions: 5.5.0, 4.9.4, 7.5.0
Arches: generic
Modules: gcc/5.5.0
, gcc/7.5.0
, gcc/4.9.4
a GPU accelerated program for Real-Time CTF determination, refinement, evaluation and correction.
Versions: 1.06
Arches: generic
Modules: gctf/1.06
Genrich is a peak-caller for genomic enrichment assays.
Versions: 0.6
Arches: generic
Modules: genrich/0.6
An interpreter for the PostScript language and for PDF.
Versions: 9.56.1
Arches: generic
Modules: ghostscript/9.56.1
Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses.
Versions: 3.02b
Arches: generic
Modules: glimmer/3.02b
HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
Versions: 1.12.2
Arches: generic
Modules: hdf5/1.12.2
HISAT2 is a fast and sensitive alignment program for mapping next- generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against the general human population (as well as against a single reference genome).
Versions: 2.2.0
Arches: generic
Modules: hisat2/2.2.0
HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
Versions: 3.3.2
Arches: generic
Modules: hmmer/3.3.2
Software for motif discovery and next generation sequencing analysis
Versions: 4.9.1
Arches: generic
Modules: homer/4.9.1
The Hardware Locality (hwloc) software project. The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across OS, versions, architectures, ...) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs. It primarily aims at helping applications with gathering information about modern computing hardware so as to exploit it accordingly and efficiently.
Versions: 2.8.0
Arches: generic
Modules: hwloc/2.8.0
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
Versions: 2.12.3
Arches: generic
Modules: igv/2.12.3
Infernal (INFERence of RNA ALignment) is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs).
Versions: 1.1.4
Arches: generic
Modules: infernal/1.1.4
Intel oneAPI Compilers. Includes: icc, icpc, ifort, icx, icpx, ifx, and dpcpp. LICENSE INFORMATION: By downloading and using this software, you agree to the terms and conditions of the software license agreements at https://intel.ly/393CijO.
Versions: 2022.2.1
Arches: generic
Modules: intel-oneapi-compilers/2022.2.1
Intel oneAPI Math Kernel Library (Intel oneMKL; formerly Intel Math Kernel Library or Intel MKL), is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math. LICENSE INFORMATION: By downloading and using this software, you agree to the terms and conditions of the software license agreements at https://intel.ly/393CijO.
Versions: 2022.2.1
Arches: generic
Modules: intel-oneapi-mkl/2022.2.1
InterProScan is the software package that allows sequences (protein and nucleic) to be scanned against InterPro's signatures. Signatures are predictive models, provided by several different databases, that make up the InterPro consortium.
Versions: 5.56-89.0
Arches: generic
Modules: interproscan/5.56-89.0
IQ-TREE Efficient software for phylogenomic inference
Versions: 2.1.3
Arches: generic
Modules: iq-tree/2.1.3
Efficient and versatile phylogenomic software by maximum likelihood
Versions: 2.1.2
Arches: generic
Modules: iqtree2/2.1.2
The Java Development Kit (JDK) released by Oracle Corporation in the form of a binary product aimed at Java developers. Includes a complete JRE plus tools for developing, debugging, and monitoring Java applications.
Versions: 17.0.1
Arches: generic
Modules: jdk/17.0.1
JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA.
Versions: 1.1.11
Arches: generic
Modules: jellyfish/1.1.11
A fast multiple sequence alignment program for biological sequences.
Versions: 3.3.1
Arches: generic
Modules: kalign/3.3.1
kallisto is a program for quantifying abundances of transcripts from RNA-Seq data.
Versions: 0.48.0
Arches: generic
Modules: kallisto/0.48.0
KmerGenie estimates the best k-mer length for genome de novo assembly.
Versions: 1.7044
Arches: generic
Modules: kmergenie/1.7044
Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
Versions: 1.0
Arches: generic
Modules: kraken/1.0
Kraken2 is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
Versions: 2.1.2
Arches: generic
Modules: kraken2/2.1.2
Metagenomics classifier with unique k-mer counting for more specific results.
Versions: 0.7.3
Arches: generic
Modules: krakenuniq/0.7.3
LAST finds similar regions between sequences, and aligns them. It is designed for comparing large datasets to each other (e.g. vertebrate genomes and/or large numbers of DNA reads).
Versions: 1282
Arches: generic
Modules: last/1282
The libevent API provides a mechanism to execute a callback function when a specific event occurs on a file descriptor or after a timeout has been reached. Furthermore, libevent also support callbacks due to signals or regular timeouts.
Versions: 2.1.12
Arches: generic
Modules: libevent/2.1.12
Fast genome and metagenome distance estimation using MinHash.
Versions: 2.3
Arches: generic
Modules: mash/2.3
MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches.
Versions: 4.0.9
Arches: generic
Modules: masurca/4.0.9
The MCL algorithm is short for the Markov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for graphs (also known as networks) based on simulation of (stochastic) flow in graphs.
Versions: 14-137
Arches: generic
Modules: mcl/14-137
MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
Versions: 1.1.4
Arches: generic
Modules: megahit/1.1.4
The MEME Suite allows the biologist to discover novel motifs in collections of unaligned nucleotide or protein sequences, and to perform a wide variety of other motif-based analyses.
Versions: 5.3.0
Arches: generic
Modules: meme/5.3.0
MetaEuk is a modular toolkit designed for large-scale gene discovery and annotation in eukaryotic metagenomic contigs.
Versions: 6-a5d39d9
Arches: generic
Modules: metaeuk/6-a5d39d9
MinCED is a program to find Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) in full genomes or environmental datasets such as metagenomes, in which sequence size can be anywhere from 100 to 800 bp.
Versions: 0.3.2
Arches: generic
Modules: minced/0.3.2
Miniasm is a very fast OLC-based de novo assembler for noisy long reads.
Versions: 2018-3-30
Arches: generic
Modules: miniasm/2018-3-30
Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Mappy provides a convenient interface to minimap2.
Versions: 2.14
Arches: generic
Modules: minimap2/2.14
miRDeep2 is a completely overhauled tool which discovers microRNA genes by analyzing sequenced RNAs.
Versions: 0.0.8
Arches: generic
Modules: mirdeep2/0.0.8
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets
Versions: 14-7e284
Arches: generic
Modules: mmseqs2/14-7e284
This project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community.
Versions: 1.48.0
Arches: generic
Modules: mothur/1.48.0
MotionCor2 is a multi-GPU program that corrects beam-induced sample motion recorded on dose fractionated movie stacks. It implements a robust iterative alignment algorithm that delivers precise measurement and correction of both global and non-uniform local motions at single pixel level, suitable for both single-particle and tomographic images. MotionCor2 is sufficiently fast to keep up with automated data collection.
Versions: 1.5.0
Arches: generic
Modules: motioncor2/1.5.0
MUMmer is a system for rapidly aligning entire genomes.
Versions: 3.23
Arches: generic
Modules: mummer/3.23
MUMmer is a versatil alignment tool for DNA and protein sequences.
Versions: 4.0.0rc1
Arches: generic
Modules: mummer4/4.0.0rc1
MUSCLE is one of the best-performing multiple alignment programs according to published benchmark tests, with accuracy and speed that are consistently better than CLUSTALW.
Versions: 3.8.1551
Arches: generic
Modules: muscle/3.8.1551
RMBlast search engine for NCBI
Versions: 2.11.0
Arches: generic
Modules: ncbi-rmblastn/2.11.0
NCBI C++ Toolkit
Versions: 26_0_1
Arches: generic
Modules: ncbi-toolkit/26_0_1
The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives. This package contains the interface to the VDB.
Versions: 3.0.0
Arches: generic
Modules: ncbi-vdb/3.0.0
Data-driven computational pipelines.
Versions: 22.10.1
Arches: generic
Modules: nextflow/22.10.1
The free and opensource java implementation
Versions: 11.0.17_8, 16.0.2
Arches: generic
Modules: openjdk/11.0.17_8
, openjdk/16.0.2
OpenLDAP Software is an open source implementation of the Lightweight Directory Access Protocol. The suite includes: slapd - stand-alone LDAP daemon (server) libraries implementing the LDAP protocol, and utilities, tools, and sample clients.
Versions: 2.4.49
Arches: generic
Modules: openldap/2.4.49
An open source Message Passing Interface implementation. The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.
Versions: 4.1.4
Arches: generic
Modules: openmpi/4.1.4
OrthoFinder is a fast, accurate and comprehensive analysis tool for comparative genomics. It finds orthologues and orthogroups infers rooted gene trees for all orthogroups and infers a rooted species tree for the species being analysed. OrthoFinder also provides comprehensive statistics for comparative genomic analyses. OrthoFinder is simple to use and all you need to run it is a set of protein sequence files (one per species) in FASTA format.
Versions: 2.5.4
Arches: generic
Modules: orthofinder/2.5.4
OrthoMCL is a genome-scale algorithm for grouping orthologous protein sequences.
Versions: 2.0.9
Arches: generic
Modules: orthomcl/2.0.9
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input.
Versions: 20220522
Arches: generic
Modules: parallel/20220522
PatchELF is a small utility to modify the dynamic linker and RPATH of ELF executables.
Versions: 0.16.1
Arches: generic
Modules: patchelf/0.16.1
PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees).
Versions: 3.697
Arches: generic
Modules: phylip/3.697
Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
Versions: 2.26.2
Arches: generic
Modules: picard/2.26.2
Pilon is an automated genome assembly improvement and variant detection tool.
Versions: 1.22
Arches: generic
Modules: pilon/1.22
PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
Versions: 1.07
Arches: generic
Modules: plink/1.07
The Process Management Interface (PMI) has been used for quite some time as a means of exchanging wireup information needed for interprocess communication. However, meeting the significant orchestration challenges presented by exascale systems requires that the process-to-system interface evolve to permit a tighter integration between the different components of the parallel application and existing and future SMS solutions. PMI Exascale (PMIx) addresses these needs by providing an extended version of the PMI definitions specifically designed to support exascale and beyond environments by: (a) adding flexibility to the functionality expressed in the existing APIs, (b) augmenting the interfaces with new APIs that provide extended capabilities, (c) forging a collaboration between subsystem providers including resource manager, fabric, file system, and programming library developers, (d) establishing a standards-like body for maintaining the definitions, and (e) providing a reference implementation of the PMIx standard that demonstrates the desired level of scalability while maintaining strict separation between it and the standard itself.
Versions: 4.1.2
Arches: generic
Modules: pmix/4.1.2
Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.
Versions: 1.14.6
Arches: generic
Modules: prokka/1.14.6
R is 'GNU S', a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information.
Versions: 4.2.0
Arches: generic
Modules: r/4.2.0
RAxML-NG is a phylogenetic tree inference tool which uses maximum- likelihood (ML) optimality criterion. Its search heuristic is based on iteratively performing a series of Subtree Pruning and Regrafting (SPR) moves, which allows to quickly navigate to the best-known ML tree. RAxML-NG is a successor of RAxML (Stamatakis 2014) and leverages the highly optimized likelihood computation implemented in libpll (Flouri et al. 2014).
Versions: 1.0.2
Arches: generic
Modules: raxml-ng/1.0.2
Parallel genome assemblies for parallel DNA sequencing
Versions: 2.3.1
Arches: generic
Modules: ray/2.3.1
Rclone is a command line program to sync files and directories to and from various cloud storage providers
Versions: 1.59.1
Arches: generic
Modules: rclone/1.59.1
RECON: a package for automated de novo identification of repeat families from genomic sequences.
Versions: 1.05
Arches: generic
Modules: recon/1.05
RELION (for REgularised LIkelihood OptimisatioN, pronounce rely-on) is a stand-alone computer program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM).
Versions: 3.1.3, 4.0.1, 4.0.0
Variants: gpu, cpu
Arches: intel+gpu_delay, amd, intel
Modules: relion/cpu/3.1.3+amd
, relion/gpu/3.1.3+amd
, relion/gpu/3.1.3+intel
, relion/cpu/4.0.0+amd
, relion/gpu/4.0.0+amd
, relion/gpu/4.0.0+intel
, relion/3.1.3
, relion/4.0.0
, relion/cpu/4.0.1+amd
, relion/gpu/4.0.1+amd
, relion/4.0.1
, relion/gpu/4.0.1+intel+gpu_delay
, relion/gpu/4.0.1+intel
A modified version of Relion supporting block-based-reconstruction as described in 10.1038/s41467-018-04051-9.
Versions: 3.1.2
Variants: gpu
Arches: intel
Modules: relion-bbr/gpu/3.1.2+intel
Utilities for Relion Cryo-EM data processing on clusters.
Versions: 0.2, 0.1, 0.3
Arches: generic
Modules: relion-helper/0.1
, relion-helper/0.2
, relion-helper/0.3
RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Versions: 4.0.9
Arches: generic
Modules: repeatmasker/4.0.9
RepeatModeler is a de-novo repeat family identification and modeling package.
Versions: 1.0.11
Arches: generic
Modules: repeatmodeler/1.0.11
RepeatScout - De Novo Repeat Finder, Price A.L., Jones N.C. and Pevzner P.A.
Versions: 1.0.5
Arches: generic
Modules: repeatscout/1.0.5
Quality assessment of de novo transcriptome assemblies from RNA-Seq data rnaQUAST is a tool for evaluating RNA-Seq assemblies using reference genome and gene database. In addition, rnaQUAST is also capable of estimating gene database coverage by raw reads and de novo quality assessment using third-party software.
Versions: 2.2.0
Arches: generic
Modules: rnaquast/2.2.0
RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data.
Versions: 1.3.1
Arches: generic
Modules: rsem/1.3.1
RStudio is an integrated development environment (IDE) for R.
Versions: 2022.12.0-353
Arches: generic
Modules: rstudio-server/2022.12.0-353
Sabre is a tool that will demultiplex barcoded reads into separate files. It will work on both single-end and paired-end data in fastq format. It simply compares the provided barcodes with each read and separates the read into its appropriate barcode file, after stripping the barcode from the read (and also stripping the quality values of the barcode bases). If a read does not have a recognized barcode, then it is put into the unknown file.
Versions: 2013-09-27
Arches: generic
Modules: sabre/2013-09-27
Satsuma2 is an optimsed version of Satsuma, a tool to reliably align large and complex DNA sequences providing maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accomodate the billions of base pairs in vertebrate genomes).
Versions: 2021-03-04
Arches: generic
Modules: satsuma2/2021-03-04
Scallop is a reference-based transcriptome assembler for RNA-seq
Versions: 0.10.5
Arches: generic
Modules: scallop/0.10.5
SeqPrep is a program to merge paired end Illumina reads that are overlapping into a single longer read.
Versions: 1.3.2
Arches: generic
Modules: seqprep/1.3.2
Toolkit for processing sequences in FASTA/Q formats.
Versions: 1.3
Arches: generic
Modules: seqtk/1.3
Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads.
Versions: 1.33
Arches: generic
Modules: sickle/1.33
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Versions: 22-05-6-1
Arches: generic
Modules: slurm/22-05-6-1
SMARTdenovo is a de novo assembler for PacBio and Oxford Nanopore (ONT) data.
Versions: master
Arches: generic
Modules: smartdenovo/master
SortMeRNA is a program tool for filtering, mapping and OTU-picking NGS reads in metatranscriptomic and metagenomic data
Versions: 2017-07-13
Arches: generic
Modules: sortmerna/2017-07-13
A flexible read trimming tool for Illumina NGS data.
Versions: 0.39
Arches: generic
Modules: trimmomatic/0.39
a communication library implementing high-performance messaging for MPI/PGAS frameworks
Versions: 1.13.1
Arches: generic
Modules: ucx/1.13.1
In order to access your HPC account, you may need to generate an SSH key pair for authorization. You generate a pair of keys: a public key and a private key. The private key is kept securely on your computer or device. The public key is submitted to HPCCF to grant you access to a cluster.
"},{"location":"general/access/#how-do-i-generate-an-ssh-key-pair","title":"How do I generate an SSH key pair?","text":""},{"location":"general/access/#windows-operating-system","title":"Windows Operating System","text":"We recommend MobaXterm as the most straightforward SSH client. You can download the free Home Edition (Installer Edition) from MobaXterm. Please download the Installer Edition. The Portable Edition deletes the contents of your home directory by default when it exits, which will remove your freshly generate SSH key. Once you install the stable version of MobaXterm, open its terminal and enter this command:
ssh-keygen
This command will create a private key and a public key. Do not share your private key; we recommend giving it a passphrase for security. To view the .ssh directory and to read the public key, enter these commands:
ls -al ~/.ssh\ncat ~/.ssh/*.pub\n
"},{"location":"general/access/#macos-and-linux","title":"macOS and Linux:","text":"Use a terminal to create an SSH key pair using the command:
ssh-keygen
To view the .ssh directory and to read the public key, enter these commands:
ls -al ~/.ssh\ncat ~/.ssh/*.pub\n
"},{"location":"general/access/#x11-forwarding","title":"X11 Forwarding","text":"Some software has a Graphical User Interface (GUI), and so requires X11 to be enabled. X11 forwarding allows an application on a remote server (in this case, Franklin) to render its GUI on a local system (your computer). How this is enabled depends on the operating system the computer you are using to access Franklin is running.
"},{"location":"general/access/#linux","title":"Linux","text":"If you are SSHing from a Linux distribution, you likely already have an X11 server running locally, and can support forwarding natively. If you are on campus, you can use the -Y
flag to enable it, like:
$ ssh -Y [USER]@[CLUSTER].hpc.ucdavis.edu\n
If you are off campus on a slower internet connection, you may get better performance by enabling compression with:
$ ssh -Y -C [USER]@[CLUSTER].hpc.ucdavis.edu\n
If you have multiple SSH key pairs, and you want to use a specific private key to connect to the clusters, use the otpion -i
to specify path to the private key with SSH: $ ssh -i ~/.ssh/id_hpc [USER]@[CLUSTER].hpc.ucdavis.edu\n
"},{"location":"general/access/#macos","title":"macOS","text":"macOS does not come with an X11 implementation out of the box. You will first need to install the free, open-source XQuartz package, after which you can use the same ssh
flags as described in the Linux instructions.
If you are using our recommend windows SSH client (MobaXterm) X11 forwarding should be enabled by default. You can confirm this by checking that the X11-Forwarding
box is ticked under your Franklin session settings. For off-campus access, you may want to tick the Compression
box as well.
HPC accounts are provisioned on a per-cluster basis and granted with the permission of their principal investigator. Accounts that are provisioned under each PI will have access to that PI's purchased resources and their own separate home directory.
Access to HPC clusters is granted via the use of SSH keys. An SSH public key is required to generate an account. For information on creating SSH keys, please visit the access documentation page.
"},{"location":"general/account-requests/#hippo","title":"HiPPO","text":"The High-Performance Personnel Onboarding (HiPPO) portal can provision resources for the Farm, Franklin, and Peloton HPC clusters. Users can request an account on HiPPO by logging in with UC Davis CAS and selecting their PI.
Users who do not have a PI and are interested in sponsored tiers for Farm can request an account by selecting the IT director for CAES, Adam Getchell, as their PI.
Users who do not have a PI and who are affiliated with the College of Letters and Science can request an sponsored account on Peloton by selecting the IT director for CLAS, Jeremy Phillips as their PI.
"},{"location":"general/account-requests/#hpc1-and-hpc2","title":"HPC1 and HPC2","text":"Users who are associated with PI's in the College of Engineering can request accounts on HPC1 and HPC2 by going to the appropriate web form.
"},{"location":"general/account-requests/#lssc0-barbera","title":"LSSC0 (Barbera)","text":"Users who want access to resources on LSSC0 can request an account within the Genome Center Computing Portal and selecting 'Request an Account' with their PI.
"},{"location":"general/account-requests/#atomate","title":"Atomate","text":"Atomate accounts can be requested here.
"},{"location":"general/account-requests/#cardio-demon-impact","title":"Cardio, Demon, Impact","text":"Accounts on these systems can be requested here.
"},{"location":"general/troubleshooting/","title":"Troubleshooting","text":""},{"location":"general/troubleshooting/#common-ssh-issues","title":"Common SSH Issues","text":"Here are some of the most common issues users face when using SSH.
"},{"location":"general/troubleshooting/#keys","title":"Keys","text":"The following clusters use SSH keys: Atomate, Farm, Franklin, HPC1, HPC2, Impact, Peloton.
If you connect to one of these and are asked for a password (as distinct from a passphrase for your key), your key is not being recognized. This is usually because of permissions or an unexpected filename. SSH expects your key to be one of a specific set of names. Unless you have specified something other than the default, this is probably going to be $HOME/.ssh/id_rsa
.
If you specified a different name when generating your key, you can specify it like this:
ssh -i $HOME/.ssh/newkey [USER]@[cluster].hpc.ucdavis.edu\n
If you kept the default value, your permissions should be set so that only you can read and write the key (-rw------- or 600)
. To ensure this is the case, you can do the following:
chown 600 $HOME/.ssh/id_rsa\n
On HPC2, your public key is kept in $HOME/.ssh/authorized_keys
. Please make sure to not remove your key from this file. Doing so will cause you will lose access.
If you are trying to use a key to access LSSC0 or any of the Genome Center login nodes, SSH keys will not work, but there is another method.
To enable logins without a password, you will need to enable GSSAPI, which some systems enable by default. If not enabled, add the following to your $HOME/.ssh/config
file (create it if it doesn't exist):
GSSAPIAuthentication yes\nGSSAPIDelegateCredentials yes\n
The -K
command line switch to ssh does the same thing on a one-time basis.
Once you have GSSAPI
enabled, you can get a Kerberos ticket using
kinit [USER]@GENOMECENTER.UCDAVIS.EDU\n
SSH will use that ticket while it's valid.
"},{"location":"general/troubleshooting/#common-slurm-scheduler-issues","title":"Common Slurm Scheduler Issues","text":"These are the most common issues with job scheduling using Slurm.
"},{"location":"general/troubleshooting/#using-a-non-default-account","title":"Using a non-default account","text":"If you have access to more than one Slurm account and wish to use an account other than your default, use the -A
or --account
flag.
e.g. If your default account is in foogrp
and you wish to use bargrp
:
srun -A bargrp -t 1:00:00 --mem=20GB scriptname.sh\n
"},{"location":"general/troubleshooting/#no-default-account","title":"No default account","text":"Newer slurm accounts have no default specified, and in this case you might get error message like:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified\n
You will need to specify the account explicitly as explained above. You can find out how to view your Slurm account information in the resources section.
"},{"location":"scheduler/","title":"Job Scheduling","text":"HPC clusters run job schedulers to distribute and manage computational resources. Generally, schedulers:
HPCCF clusters use Slurm for job scheduling. A central controller runs on one of the file servers, which users submit jobs to from the access node using the srun
and sbatch
commands. The controller then determines a priority for the job based on the resources requested and schedules it on the queue. Priority calculation can be complex, but the overall goal of the scheduler is to optimize a tradeoff between throughput on the cluster as a whole and turnaround time on jobs.
The Commands section describes how to manage jobs and check cluster status using standard Slurm commands. The Resources section describes how to request computing resources for jobs. The Job Scripts section includes examples of job scripts to be used with sbatch
.
After logging in to a cluster, your session exists on the head node: a single, less powerful computer that serves as the gatekeeper to the rest of the cluster. To do actual work, you will need to write submission scripts that define your job and submit them to the cluster along with resource requests.
"},{"location":"scheduler/commands/#batch-jobs-sbatch","title":"Batch Jobs:sbatch
","text":"Most of the time, you will want to submit jobs in the form of job scripts. The batch job script specifies the resources needed for the job, such as the number of nodes, cores, memory, and walltime. A simple example would be:
jobscript.sh#!/bin/bash \n# (1)\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=1\n#SBATCH --time=01:00:00\n#SBATCH --mem=100MB\n#SBATCH --partition=low\n\necho \"Running on $(hostname)\"\n
/bin/sh
or /bin/zsh
.Which can be submitted to the scheduler by running:
$ sbatch jobscript.sh\nSubmitted batch job 629\n
The job script is a normal shell script -- note the #!/bin/bash
-- that contains additional directives. #SBATCH
lines specify directives to be sent to the scheduler; in this case, our resource requests:
--ntasks
: Number of tasks to run. Slurm may schedule tasks on the same or different nodes.--cpus-per-task
: Number of CPUs (cores) to allocate per task.--time
: Maximum wallclock time for the job.--mem
: Maximum amount of memory for the job.--partition
: The queue partition to submit to. See the queueing section for more details.Jobs that exceed their memory or time constraints will be automatically killed. There is no limit on spawning threads, but keep in mind that using far more threads than requested cores will result in rapidly decreasing performance.
#SBATCH
directives directly correspond to arguments passed to the sbatch
command. As such, one could remove the lines starting with #SBATCH
from the previous job script and submit it with:
$ sbatch --ntasks=1 --cpus-per-task=1 --time=01:00:00 --mem=100MB --partition=low jobscript.sh\n
Using directives with job scripts is recommended, as it helps you document your resource requests.
Try man sbatch
or visit the official docs for more options. More information on resource requests can be found in the Resources section, and more examples on writing job scripts can be found in the Job Scripts section.
srun
","text":"Sometimes, you want to run an interactive shell session on a node, such as running an IPython session. srun
takes the same parameters as sbatch
, while also allowing you to specify a shell. For example:
$ srun --ntasks=1 --time=01:00:00 --mem=100MB --partition=low --pty /bin/bash\nsrun: job 630 queued and waiting for resources\nsrun: job 630 has been allocated resources\ncamw@c-8-42:~$\n
Note that addition of the --pty /bin/bash
argument. You can see that the job is queued and then allocated resources, but instead of exiting, you are brought to a new prompt. In the example above, the user camw
has been moved onto the node c-8-42
, which is indicated by the new terminal prompt, camw@c-8-42
. The same resource and time constraints apply in this session as in sbatch
scripts.
This is the only way to get direct access to a node: you will not be able to simply do ssh c-8-42
, for example.
Try man srun
or visit the official docs for more options.
squeue
","text":"squeue
can be used to monitor running and queued jobs. Running it with no arguments will show all the jobs on the cluster; depending on how many users are active, this could be a lot!
$ squeue\n JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)\n 589 jawdatgrp Refine3D adtaheri R 1-13:51:39 1 gpu-9-18\n 631 low jobscrip camw R 0:19 1 c-8-42\n 627 low Class2D/ mashaduz R 37:11 1 gpu-9-58\n...\n
To view only your jobs, you can use squeue --me
.
$ squeue --me\n JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)\n 631 low jobscrip camw R 0:02 1 c-8-42\n
The format -- which columns and their width -- can be tuned with the --format
parameter. For example, you might way to also include how many cores the job requested, and widen the fields:
$ squeue --format=\"%10i %.9P %.20j %.10u %.3t %.25S %.15L %.10C %.6D %.20R\"\nJOBID PARTITION NAME USER ST START_TIME TIME_LEFT CPUS NODES NODELIST(REASON)\n589 jawdatgrp Refine3D/job015/ adtaheri R 2023-01-31T22:51:59 9:58:38 6 1 gpu-9-18\n627 low Class2D/job424/ mashaduz R 2023-02-02T12:06:27 11:13:06 60 1 gpu-9-58\n
Try man squeue
or visit the official docs for more options.
scancel
","text":"To kill a job before it has completed, use the scancel command:
$ scancel JOBID # (1)!\n
JOBID
with the ID of your job, which can be obtained with squeue
.You can cancel many jobs at a time; for example, you could cancel all of your running jobs with:
$ scancel -u $USER #(1)!\n
$USER
is an environment variable containing your username, so leave this as is to use it.Try man scancel
or visit the official docs for more options.
scontrol
","text":"scontrol show
can be used to display any information known to Slurm. For users, the most useful are the detailed job and node information.
To display details for a job, run:
$ scontrol show j 635\nJobId=635 JobName=jobscript.sh\n UserId=camw(1134153) GroupId=camw(1134153) MCS_label=N/A\n Priority=6563 Nice=0 Account=admin QOS=adminmed\n JobState=RUNNING Reason=None Dependency=(null)\n Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0\n RunTime=00:00:24 TimeLimit=01:00:00 TimeMin=N/A\n SubmitTime=2023-02-02T13:26:24 EligibleTime=2023-02-02T13:26:24\n AccrueTime=2023-02-02T13:26:24\n StartTime=2023-02-02T13:26:25 EndTime=2023-02-02T14:26:25 Deadline=N/A\n PreemptEligibleTime=2023-02-02T13:26:25 PreemptTime=None\n SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-02-02T13:26:25 Scheduler=Main\n Partition=low AllocNode:Sid=nas-8-0:449140\n ReqNodeList=(null) ExcNodeList=(null)\n NodeList=c-8-42\n BatchHost=c-8-42\n NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*\n TRES=cpu=2,mem=100M,node=1,billing=2\n Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*\n MinCPUsNode=1 MinMemoryNode=100M MinTmpDiskNode=0\n Features=(null) DelayBoot=00:00:00\n OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)\n Command=/home/camw/jobscript.sh\n WorkDir=/home/camw\n StdErr=/home/camw/slurm-635.out\n StdIn=/dev/null\n StdOut=/home/camw/slurm-635.out\n Power=\n
Where 635
should be replaced with the ID for your job. For example, you can see that this job was allocated resources on c-8-42
(NodeList=c-8-42
), that its priority score is 6563 (Priority=6563
), and that the script it ran with is located at /home/camw/jobscript.sh
.
We can also get details on nodes. Let's interrogate c-8-42
:
$ scontrol show n c-8-42\nNodeName=c-8-42 Arch=x86_64 CoresPerSocket=64 \n CPUAlloc=4 CPUEfctv=256 CPUTot=256 CPULoad=0.12\n AvailableFeatures=amd,cpu\n ActiveFeatures=amd,cpu\n Gres=(null)\n NodeAddr=c-8-42 NodeHostName=c-8-42 Version=22.05.6\n OS=Linux 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 \n RealMemory=1000000 AllocMem=200 FreeMem=98124 Sockets=2 Boards=1\n State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A\n Partitions=low,high \n BootTime=2022-12-11T02:25:44 SlurmdStartTime=2022-12-14T10:34:25\n LastBusyTime=2023-02-02T13:13:22\n CfgTRES=cpu=256,mem=1000000M,billing=256\n AllocTRES=cpu=4,mem=200M\n CapWatts=n/a\n CurrentWatts=0 AveWatts=0\n ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s\n
CPUAlloc=4
tells us that 4 cores are currently allocated on the node. AllocMem=200
indicates that 200MiB of RAM are currently allocated, with RealMemory=1000000
telling us that there is 1TiB of RAM total on the node.
sinfo
","text":"Another useful status command is sinfo
, which is specialized for displaying information on nodes and partitions. Running it without any arguments gives information on partitions:
$ sinfo\nPARTITION AVAIL TIMELIMIT NODES STATE NODELIST\nlow* up 12:00:00 3 mix gpu-9-[10,18,58]\nlow* up 12:00:00 8 idle c-8-[42,50,58,62,70,74],gpu-9-[26,66]\nhigh up 60-00:00:0 6 idle c-8-[42,50,58,62,70,74]\njawdatgrp-gpu up infinite 2 mix gpu-9-[10,18]\njawdatgrp-gpu up infinite 1 idle gpu-9-26\n
In this case, we can see that there are 3 partially-allocated nodes in the low
partition (they have state mix
), and that the time limit for jobs on the low
partition is 12 hours.
Passing the -N
flag tells sinfo
to display node-centric information:
$ sinfo -N\nNODELIST NODES PARTITION STATE \nc-8-42 1 low* idle \nc-8-42 1 high idle \nc-8-50 1 low* idle \nc-8-50 1 high idle \nc-8-58 1 low* idle \nc-8-58 1 high idle \nc-8-62 1 low* idle \nc-8-62 1 high idle \nc-8-70 1 low* idle \nc-8-70 1 high idle \nc-8-74 1 low* idle \nc-8-74 1 high idle \ngpu-9-10 1 low* mix \ngpu-9-10 1 jawdatgrp-gpu mix \ngpu-9-18 1 low* mix \ngpu-9-18 1 jawdatgrp-gpu mix \ngpu-9-26 1 low* idle \ngpu-9-26 1 jawdatgrp-gpu idle \ngpu-9-58 1 low* mix \ngpu-9-66 1 low* idle\n
There is an entry for each node in each of its partitions. c-8-42
is in both the low
and high
partitions, while gpu-9-10
is in the low
and jawdatgrp-gpu
partitions.
More verbose information can be obtained by also passing the -l
or --long
flag:
$ sinfo -N -l\nThu Feb 02 14:04:48 2023\nNODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON \nc-8-42 1 low* idle 256 2:64:2 100000 0 1 amd,cpu none \nc-8-42 1 high idle 256 2:64:2 100000 0 1 amd,cpu none \nc-8-50 1 low* idle 256 2:64:2 100000 0 1 amd,cpu none \nc-8-50 1 high idle 256 2:64:2 100000 0 1 amd,cpu none \nc-8-58 1 low* idle 256 2:64:2 100000 0 1 amd,cpu none\n...\n
This view gives the nodes' socket, core, and thread configurations, their RAM, and the feature list, which you can read about in the Resources section. Try man scontrol
or man sinfo
, or visit the official docs for scontrol
and sinfo
, for more options.
Each node -- physically distinct machines within the cluster -- will be a member of one or more partitions. A partition consists of a collection of nodes, a policy for job scheduling on that partition, a policy for conflicts when nodes are a member of more than one partition (ie. preemption), and a policy for managing and restricting resources per user or per group referred to as Quality of Service. The Slurm documentation has detailed information on how preemption and QOS definitions are handled; our per-cluster Resources sections describe how partitions are organized and preemption handled on our clusters.
"},{"location":"scheduler/resources/#accounts","title":"Accounts","text":"Users are granted access to resources via Slurm associations. An association links together a user with an account and a QOS definition. Accounts most commonly correspond to your lab, but sometimes exist for graduate groups, departments, or institutes.
To see your associations, and thus which accounts and partitions you have access to, you can use the sacctmgr
command:
$ sacctmgr show assoc user=$USER\n Cluster Account User Partition Share ... MaxTRESMins QOS Def QOS GrpTRESRunMin \n---------- ---------- ---------- ---------- --------- ... ------------- -------------------- --------- ------------- \n franklin hpccfgrp camw mmgdept-g+ 1 ... hpccfgrp-mmgdept-gp+ \n franklin hpccfgrp camw mmaldogrp+ 1 ... hpccfgrp-mmaldogrp-+ \n franklin hpccfgrp camw cashjngrp+ 1 ... hpccfgrp-cashjngrp-+ \n franklin hpccfgrp camw jalettsgr+ 1 ... hpccfgrp-jalettsgrp+ \n franklin hpccfgrp camw jawdatgrp+ 1 ... hpccfgrp-jawdatgrp-+ \n franklin hpccfgrp camw low 1 ... hpccfgrp-low-qos \n franklin hpccfgrp camw high 1 ... hpccfgrp-high-qos \n franklin jawdatgrp camw low 1 ... mcbdept-low-qos \n franklin jawdatgrp camw jawdatgrp+ 1 ... jawdatgrp-jawdatgrp+ \n franklin jalettsgrp camw jalettsgr+ 1 ... jalettsgrp-jalettsg+ \n franklin jalettsgrp camw low 1 ... mcbdept-low-qos \n
The output is very wide, so you may want to pipe it through less
to make it more readable:
sacctmgr show assoc user=$USER | less -S\n
Or, perhaps preferably, output it in a more compact format:
$ sacctmgr show assoc user=$USER format=\"account%20,partition%20,qos%40\"\n Account Partition QOS \n-------------------- -------------------- ---------------------------------------- \n hpccfgrp mmgdept-gpu hpccfgrp-mmgdept-gpu-qos \n hpccfgrp mmaldogrp-gpu hpccfgrp-mmaldogrp-gpu-qos \n hpccfgrp cashjngrp-gpu hpccfgrp-cashjngrp-gpu-qos \n hpccfgrp jalettsgrp-gpu hpccfgrp-jalettsgrp-gpu-qos \n hpccfgrp jawdatgrp-gpu hpccfgrp-jawdatgrp-gpu-qos \n hpccfgrp low hpccfgrp-low-qos \n hpccfgrp high hpccfgrp-high-qos \n jawdatgrp low mcbdept-low-qos \n jawdatgrp jawdatgrp-gpu jawdatgrp-jawdatgrp-gpu-qos \n jalettsgrp jalettsgrp-gpu jalettsgrp-jalettsgrp-gpu-qos \n jalettsgrp low mcbdept-low-qos \n
In the above example, we can see that user camw
has access to the high
partition via an association with hpccfgrp
and the jalettsgrp-gpu
partition via the jalettsgrp
account.
CPUs are the central compute power behind your jobs. Most scientific software supports multiprocessing (multiple instances of an executable with discrete memory resources, possibly but not necessarily communicating with each other), multithreading (multiple paths, or threads, of execution within a process on a node, sharing the same memory resources, but able to execute on different cores), or both. This allows computation to scale with increased numbers of CPUs, allowing bigger datasets to be analyzed.
Slurm's CPU management methods are complex and can quickly become confusing. For the purposes of this documentation, we will provide a simplified explanation; those with advanced needs should consult the Slurm documentation.
Slurm follows a distinction between its physical resources -- cluster nodes and CPUs or cores on a node -- and virtual resources, or tasks, which specify how requested physical resources will be grouped and distributed. By default, Slurm will minimize the number of nodes allocated to a job, and attempt to keep the job's CPU requests localized within a node. Tasks group together CPUs (or other resources): CPUs within a task will be kept together on the same node. Different tasks may end up on different nodes, but Slurm will exhaust the CPUs on a given node before splitting tasks between nodes unless specifically requested.
A Complication: SMT / Hyperthreading
Slurm understands the distinction between physical and logical cores. Most modern CPUs support Simultaneous Multithreading (SMT), which allows multiple independent processes to run on a single physical core. Although each of these is not a full fledged core, they have independent hardware for certain operations, and can greatly improve scalability for some tasks. However, using an individual thread within a single core makes little sense, as it shares hardware with the other SMT threads on its core; so, Slurm will always keep these threads together. In practice, this means if you ask for an odd number of CPUs, your request will be rounded up so as not to split an SMT thread between different job allocations.
The primary parameters controlling these are:
--cpus-per-task/-c
: How many CPUs to request per task. The number of CPUs requested here will always be on the same node. By default, 1.--ntasks/-n
: The number of tasks to request. By default, 1.--nodes/-N
: The minimum number of nodes to request, by default, 1.Let's explore some examples. The simple request would be to ask for 2 CPUs. We will use srun
to request resources and then immediately run the nproc
command within the allocation to report how many CPUs are available:
$ srun -c 2 nproc \nsrun: job 682 queued and waiting for resources\nsrun: job 682 has been allocated resources\n2\n
We asked for 2 CPUs per task, and Slurm gave us 2 CPUs and 1 task. What happens if we ask for 2 tasks instead of 2 CPUs?
$ srun -n 2 nproc\nsrun: job 683 queued and waiting for resources\nsrun: job 683 has been allocated resources\n1\n1\n
This time, we were given 2 separate tasks, each of which got 1 CPU. Each task ran its own instance of the nproc
command, and so each reported 1
. If we ask for more CPUs per task:
$ srun -n 2 -c 2 nproc\nsrun: job 684 queued and waiting for resources\nsrun: job 684 has been allocated resources\n2\n2\n
We still asked for 2 tasks, but this time we requested 2 CPUs in each. So, we got 2 instances of nproc
, each reported 2
CPUs in their task.
Summary
If you want to run multithreaded jobs, use --cpus-per-task N_THREADS
and -ntasks 1
. If you want a multiprocess job (or an MPI job), increase -ntasks
.
If we use -c 1
without specifying the number of tasks, we might be taken by surprise:
$ srun -c 1 nproc \nsrun: job 685 queued and waiting for resources\nsrun: job 685 has been allocated resources\n1\n1\n
We only asked for 1 CPU per task, but we got 2 tasks! This is due to SMT, described in the note above. Because Slurm will not split SMT threads, and there are 2 SMT threads per physical core, the request was rounded up to 2 CPUs total. In order to keep with the 1 CPU-per-task constraint, it spawned 2 tasks. Similarly, if we specify that we only want 1 task, CPUs per task will instead be bumped:
$ srun -c 1 -n 1 nproc\nsrun: job 686 queued and waiting for resources\nsrun: job 686 has been allocated resources\n2\n
"},{"location":"scheduler/resources/#nodes","title":"Nodes","text":"Let's explore multiple nodes a bit further. We have seen previously that the -n/ntasks
parameter will allocate discrete groups of cores. In our prior examples, however, we used small resource requests. What happens when we want to distribute jobs across nodes?
Slurm uses the block distribution method by default to distribute tasks between nodes. It will exhaust all the CPUs on a node with task groups before moving to a new node. For these examples, we're going to create a script that reports both the hostname (ie, the node) and the number of CPUs:
host-nprocs.sh#!/bin/bash\n\necho `hostname`: `nproc`\n
And make it executable with chmod +x host-nprocs.sh
.
Now let's make a multiple-task request:
$ srun -c 2 -n 2 ./host-nprocs.sh\nsrun: job 691 queued and waiting for resources\nsrun: job 691 has been allocated resources\nc-8-42: 2\nc-8-42: 2\n
As before, we asked for 2 tasks and 2 CPUs per task. Both tasks were assigned to c-8-42
, because it had enough CPUs to fulfill the request. What if it did not?
$ srun -c 120 -n 3 ./host-nprocs.sh\nsrun: job 692 queued and waiting for resources\nsrun: job 692 has been allocated resources\nc-8-42: 120\nc-8-42: 120\nc-8-50: 120\n
This time, we asked for 3 tasks each with 120 CPUs. The first two tasks were able to be fulfilled by the node c-8-42
, but that node did not have enough CPUs to allocate another 120 on top of that. So, the third task was distributed to c-8-50
. Thus, this task spanned multiple nodes.
Sometimes, we want to make sure each task has its own node. We can achieve this with the --nodes/-N
parameter. This specifies the minimum number of nodes the tasks should be allocated across. If we rerun the above example:
$ srun -c 120 -n 3 -N 3 ./host-nprocs.sh\nsrun: job 693 queued and waiting for resources\nsrun: job 693 has been allocated resources\nc-8-42: 120\nc-8-50: 120\nc-8-58: 120\n
We still asked for 3 tasks and 3 CPUs per task, but this time we specified we wanted a minimum of 3 nodes. As a result, we were allocated portions of c-8-42
, c-8-50
, and c-8-58
.
Random Access Memory (RAM) is the fast, volatile storage that your programs use to store data during execution. This can be contrasted with disk storage, which is non-volatile and many orders of magnitude slower to access, and is used for long term data -- say, your sequencing runs or cryo-EM images. RAM is a limited resource on each node, so Slurm enforces memory limits for jobs using cgroups. If a job step consumes more RAM than requested, the step will be killed.
Some (mutually exclusive) parameters for requesting RAM are:
--mem
: The memory required per-node. Usually, you want to use --mem-per-cpu
.--mem-per-cpu
: The memory required per CPU or core. If you requested \\(N\\) tasks, \\(C\\) CPUs per task, and \\(M\\) memory per CPU, your total memory usage will be \\(N * C * M\\). Note that, if \\(N \\gt 1\\), you will have \\(N\\) discrete \\(C * M\\)-sized chunks of RAM requested, possibly across different nodes.--mem-per-gpu
: Memory required per GPU, which will scale with GPUs in the same way as --mem-per-cpu
will with CPUs.For all memory requests, units can be specified explicitly with the suffixes [K|M|G|T]
for [kilobytes|megabytes|gigabytes|terabytes]
, with the default units being M
/megabytes
. So, --mem-per-cpu=500
will requested 500 megabytes of RAM per CPU, and --mem-per-cpu=32G
will request 32 gigabytes of RAM per CPU.
Here is an example of a task overrunning its memory allocation. We will use the stress-ng
program to allocate 8 gigabytes of RAM in a job that only requested 200 megabytes.
$ srun -n 1 --cpus-per-task 2 --mem-per-cpu 200M stress-ng -m 1 --vm-bytes 8G --oomable 1 \u21b5\nsrun: job 706 queued and waiting for resources\nsrun: job 706 has been allocated resources\nstress-ng: info: [3037475] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor\nstress-ng: info: [3037475] dispatching hogs: 1 vm\nstress-ng: info: [3037475] successful run completed in 2.23s\nslurmstepd: error: Detected 1 oom-kill event(s) in StepId=706.0. Some of your processes may have been killed by the cgroup out-of-memory handler.\nsrun: error: c-8-42: task 0: Out Of Memory\nsrun: launch/slurm: _step_signal: Terminating StepId=706.0\n
"},{"location":"scheduler/resources/#gpus-gres","title":"GPUs / GRES","text":""},{"location":"software/","title":"Software","text":"We provide a broad range of software and scientific libraries for our users. On our primary clusters, we deploy software with spack and conda to foster reproducibility. Packages installed via spack have automatically generated modulefiles available for each package, while our conda-deployed software is deployed in individual, centrally installed environments which can be loaded via module or activated traditionally.
To request new software, please read the information in our Software Installation Policy, and submit a request through our Software Request Form.
Because spack is our primary installation method, we are far more likely to approve a new piece of software if it is already available via a spack repo. You can check whether a spack package for your software already exists here.
"},{"location":"software/conda/","title":"Python and Conda","text":""},{"location":"software/modules/","title":"Module System","text":""},{"location":"software/modules/#intro","title":"Intro","text":"High performance compute clusters usually have a variety of software with sometimes conflicting dependencies. Software packages may need to make modifications to the user environment, or the same software may be compiled multiple times to run efficiently on differing hardware within the cluster. To support these use cases, software is managed with a module system that prepares the user environment to access specific software on load and returns the environment to its former state when unloaded. A module is the bit of code that enacts and tracks these changes to the user environment, and the module system is software that runs these modules and the collection of modules it is aware of. Most often, a module is associated with a specific software package at a specific version, but they can also be used to make more general changes to a user environment; for example, a module could load a set of configurations for the BASH shell that set color themes.
The two most commonly deployed module systems are environment modules (or envmod
) and lmod. All HPCCF clusters currently use envmod
.
The module
command is the entry point for users to manage modules in their environment. All module operations will be of the form module [SUBCOMMAND]
. Usage information is available on the cluster by running module --help
.
The basic commands are: module load [MODULENAME]
to load a module into your environment; module unload [MODULENAME]
to remove that module; module avail
to see modules available for loading; and module list
to see which modules are currently loaded. We will go over these commands, and some additional commands, in the following sections.
module avail
","text":"Lists the modules currently available to load on the system. The following is some example output from the Franklin cluster:
$ module avail\n--------------------- /share/apps/22.04/modulefiles/spack/core ----------------------\naocc/4.1.0 intel-oneapi-compilers/2023.2.1 pmix/4.2.6+amd \ncuda/8.0.61 libevent/2.1.12 pmix/4.2.6+intel \ncuda/11.2.2 libevent/2.1.12+amd pmix/default \ncuda/11.7.1 libevent/2.1.12+intel slurm/23-02-6-1 \ncuda/default libevent/default slurm/23-02-7-1 \nenvironment-modules/5.2.0 openmpi/4.1.5 ucx/1.14.1 \ngcc/7.5.0 openmpi/4.1.5+amd ucx/1.14.1+amd \ngcc/11.4.0 openmpi/4.1.5+intel ucx/1.14.1+intel \ngcc/13.2.0 openmpi/default ucx/default \nhwloc/2.9.1 pmix/4.2.6 \n\n------------------- /share/apps/22.04/modulefiles/spack/software --------------------\nabyss/2.3.5 igv/2.12.3 pplacer/1.1.alpha19 \nalphafold/2.3.2 infernal/1.1.4 prodigal/2.6.3 \namdfftw/4.1+amd intel-oneapi-mkl/2023.2.0+intel prokka/1.14.6 \namdfftw/default intel-oneapi-mkl/default py-checkm-genome/1.2.1 \nangsd/0.935 intel-oneapi-tbb/2021.10.0+amd py-cutadapt/4.4 \naragorn/1.2.41 intel-oneapi-tbb/2021.10.0+intel py-deeptools/3.5.2 \naria2/1.36.0 intel-oneapi-tbb/default py-htseq/2.0.3\n...\n
Each entry corresponds to software available for load. Modules that are currently loaded will be highlighted.
"},{"location":"software/modules/#module-list","title":"module list
","text":"Lists the modules currently loaded in the user environment. By default, the output should be similar to:
$ module list\nCurrently Loaded Modulefiles:\n 1) slurm/23-02-7-1 2) ucx/1.14.1 3) openmpi/default \n
Additional modules will be added or removed as you load and unload them.
"},{"location":"software/modules/#loading-and-unloading","title":"Loading and Unloading","text":""},{"location":"software/modules/#module-load","title":"module load
","text":"This loads the requested module into the active environment. Loading a module can edit environment variables, such as prepending directories to $PATH
so that the executables within can be run, set and unset new or existing environment variables, define shell functions, and generally, modify your user environment arbitrarily. The modifications it makes are tracked, so that when the module is eventually unloaded, any changes can be returned to their former state.
Let's load a module.
$ module load bwa/0.7.17\nbwa/0.7.17: loaded.\n
Now, you have access to the bwa
executable. If you try to run bwa mem
, you'll get its help output. This also sets the appropriate variables so that you can now run man bwa
to view its manpage.
Note that some modules have multiple versions. Running module load [MODULENAME]
without specifying a version will load the latest version, unless a default has been specified.
Some modules are nested under a deeper hierarchy. For example, relion
on Franklin has many variants, under both relion/cpu
and relion/gpu
. To load these, you must specify the second layer of the hierarchy: module load relion
will fail, but module load relion/cpu
will load the default module under relion/cpu
, which has the full name relion/cpu/4.0.0+amd
. More information on this system can be found under Organization.
The modules are all configured to set a $NAME_ROOT
variable that points to the installation prefix. This will correspond to the name of the module, minus the version. For example:
$ echo $BWA_ROOT\n/share/apps/22.04/spack/opt/software/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/bwa-0.7.17-y22jt6d7qm63i2tohmu7gqeedxytadky\n
Usually, this will be a very long pathname, as most software on the cluster is managed via the spack build system. This would be most useful if you're developing software on the cluster.
"},{"location":"software/modules/#module-unload","title":"module unload
","text":"As one might expect, module unload
removes a loaded module from your environment. Any changes made by the module are undone and your environment is restored to its state prior to loading the module.
module whatis
","text":"This command prints a description of the module, if such a description is available. For example:
$ module whatis gsl\n-------------------------------------------- /share/apps/22.04/modulefiles/spack/software --------------------------------------------\n gsl/2.7.1: The GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License. The library provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite.\n
"},{"location":"software/modules/#module-show","title":"module show
","text":"module show
will list all the changes a module would make to the environment.
$ module show gcc/11.4.0\n-------------------------------------------------------------------\n/share/apps/22.04/modulefiles/spack/core/gcc/11.4.0:\n\nmodule-whatis {The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Ada, and Go, as well as libraries for these languages.}\nconflict gcc\nprepend-path --delim : LD_LIBRARY_PATH /share/apps/22.04/spack/opt/core/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-11.4.0-evrz2iaatpna4lvzwh5sjujgfrlqprx5/lib64\n...\nsetenv CC /share/apps/22.04/spack/opt/core/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-11.4.0-evrz2iaatpna4lvzwh5sjujgfrlqprx5/bin/gcc\nsetenv CXX /share/apps/22.04/spack/opt/core/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-11.4.0-evrz2iaatpna4lvzwh5sjujgfrlqprx5/bin/g++\n...\n
This is particularly useful for developing with libraries where you might be interested in variables relevant to your build system.
"},{"location":"software/modules/#module-search","title":"module search
","text":"The module search
command allows you to search the names and whatis information for every module. The result will be a list of matching modules and the highlighted matching search terms.
Welcome to the High-Performance Computing Core Facility (HPCCF) Documentation Site. These pages are intended to be a how-to for commonly asked questions about resources supported by the UC Davis High-Performance Computing Core Facility.
HPCCF is a core faciilty reporting through the Office of Research and supported by individual university colleges, the Office of the Provost, and the Vice Chancellor for Research.
Before contacting HPCCF support, first try searching this documentation. This site provides information on accessing and interacting with HPCCF supported clusters, an overview of available software ecosystems, and tutorials for commonly used software and access patterns. It is split into a Users section for end-users and an Admin section with information relevant to system administrators and advanced users.
Questions about the documentation or resources supported by the HPCCF can be directed to hpc-help@ucdavis.edu.
"},{"location":"#getting-started-with-hpc","title":"Getting Started with HPC","text":"Please read the Supported Services page.
The high-performance computing model at UC Davis starts with a principal investigator (PI) purchasing resources (compute, GPU, storage) and making them available to their lab. HPCCF will assist in onboarding and providing support.
As a new principal investigator who is interested in purchasing resources, please read the Our Clusters section below to determine which clusters are appropriate for onboarding. HPCCF can assist with hardware and storage investments for condo clusters and sell fair-share priority, primary and archive storage for fair-share clusters. Please email hpc-help@ucdavis.edu with your affiliation to start the onboarding process. Resources external to UC Davis can also purchase resources by inquiring at the hpc-help email address.
For getting started with HPC access under an existing PI, please see requesting an account.
"},{"location":"#our-clusters","title":"Our Clusters","text":""},{"location":"#condo-clusters","title":"Condo Clusters","text":"An HPC condo-style cluster is a shared computing infrastructure where different users or groups contribute resources (such as compute nodes, storage, or networking) to a common pool, similar to how individual condo owners share common amenities.
Farm: Sponsored by the College of Agriculture and Environmental Sciences. Farm resources can be purchased by principal investigators regardless of affiliation.
Franklin: Sponsored by units within the College of Biological Sciences, Franklin is open to PI's within the Center for Neuroscience, Microbiology and and Molecular Genetics, Molecular and Cellular Biology, and other approved collaborators.
HPC2: Sponsored by the College of Engineering and Computer Science and is open to principal investigators associated with COE.
Peloton: Peloton is open to principal investigators associated with the College of Letters and Science. Peloton has a shared tier open to users associated with CLAS.
"},{"location":"#fair-share-clusters","title":"Fair-Share Clusters","text":"A fair-share HPC algorithm is a resource allocation strategy used to ensure equitable access to computing resources among multiple users or groups. The goal is to balance the workload and prevent any single user or group from monopolizing the resources.
LSSC0 (Barbera) an HPC shared resource which is coordinated and ran by HPCCF. LSSC0 is ran with a fair-share algorithm.
"},{"location":"#how-to-ask-for-help","title":"How to ask for help","text":"Emails sent to the HPCCF are documented in Service Now via hpc-help@ucdavis.edu. Please include your name, relevant cluster name, account name if possible, and a brief description of your request or question. Please be patient, as HPCCF staff will respond to your inquiry as soon as possible. HPCCF staff are available to respond to requests on scheduled university work days and are available from 8:00 am to 5:00 pm.
"},{"location":"#contributing-to-the-documentation","title":"Contributing to the documentation","text":"This site is written in markdown using MkDocs with the Material for MkDocs theme. If you would like to contribute, you may fork our repo and submit a pull request.
"},{"location":"#additional-information","title":"Additional Information","text":"This section is for HPCCF admins to document our internal infrastructure, processes, and architectures. Although the information may be of interest to end users, it is not designed or maintained for their consumption; nothing written here should be confused as an offering of service. For example, although we describe our Virtual Machine infrastructure, which we used for hosting a variety of production-essential services for our clusters, we do not offer VM hosting for end users.
"},{"location":"admin/cobbler/","title":"Cobbler","text":"HPCCF uses cobbler for provisioning and managing internal DNS.
There is a cobbler server per cluster as well as one for the public HPC VLAN.
cobbler.hpc
- public HPC VLAN.cobbler.hive
- hive private and management VLANscobbler.farm
- farm cobbler.peloton
- pelotoncobbler.franklin
- franklinhpc1
, hpc2
, and lssc0
do not have associated cobbler servers.
cobbler system add --name=<hostname> --profile=infrastructure --netboot=false --interface=default --mac=xx:xx:xx:xx:xx:xx --dns-name=hostname.cluster.hpc.ucdavis.edu --hostname=<hostname> --ip-address=10.11.12.13\n
"},{"location":"admin/configuration/","title":"Configuration Management","text":"ie: puppet
"},{"location":"admin/ddn/","title":"DDN","text":"The DDN provides backend storage for proxmox.
"},{"location":"admin/ddn/#access","title":"Access","text":"The primary means of administration is via the web interface. You will need to be on the HPC VLAN.
"},{"location":"admin/dns/","title":"DNS","text":"DNS is split between internal (what machines on one of the HPCCF VLANs see) vs. external (what the rest of the campus and world sees).
"},{"location":"admin/dns/#external","title":"External","text":"HPCCF uses InfoBlox for public-facing DNS.
"},{"location":"admin/dns/#internal","title":"Internal","text":"Internal DNS is managed by cobbler.
"},{"location":"admin/netbox/","title":"Netbox","text":"HPCCF's Netbox Site is our source of truth for our rack layouts, network addressing, and other infrastructure. NetBox is an infrastructure resource modeling (IRM) application designed to empower network automation. NetBox was developed specifically to address the needs of network and infrastructure engineers.
"},{"location":"admin/netbox/#netbox-features","title":"Netbox Features","text":"This section will give an overview of how HPCCF admins utilize and administer Netbox.
"},{"location":"admin/netbox/#how-to-add-assets-into-netbox","title":"How to add assets into Netbox","text":"Navigate to HPCCF's Netbox instance here: HPCCF's Netbox Site
Select the site to which you will be adding an asset too. In this example I have chosen Campus DC:
Scroll down to the bottom of this page and select which of the locations you will add your asset too, here I chose the Storage Cabinet:
On this page scroll to the bottom and select Add a Device:
After you have selected Add a Device you should see a page like this:
Fill out this page with specifics of the asset, some fields are not required but try to fill out this section as much as possible with the fields available, here is an example of a created asset and how it should look:
Ensure to click on Save to have the device added.
On the asset page select the + Add Components dropdown and select the component you wish to add, for this I have chosen a Console Port:
Here again you will fill out the dropdowns as thoroughly as possible, the example here is of an interface that has already been added:
Again make sure to click Save to ensure the component has been added.
This process can be used to add all of the following componentes to a device:
After a component has been created such as an interface, power port or any other type of component you will want to connect it to something. For any component the process is similar within Netbox. In this example it will show how to connect an Infiniban port on a device to a port on an Infiniban switch. First navigate to the device you wish to work with and select the appropriate tab, in this case it will be Interfaces and you will see a page like this:
Here we will connect ib1 to an infiniban switch by clicking the green dropdown off to the right of ib1 and we will be connecting to another interface on the infiniban switch so we will choose interface as shown here:
Once selected you will come to a screen that looks like this:
Once all filled out with the required information to complete the connection (and any additional information that can be provided) at the bottom make sure to create the connection, your screen should look something like this:
This is meant to be a general configuration guide to Open OnDemand (OOD) for admins. But, I'd also like this to serve as an admin troubleshooting tutorial for OOD. So, the bulk of relevant OOD configs are located in /etc/ood/config/
but the contents within are controlled by puppet. Usually, OOD is served by, or behind, apache and those configs are located in /etc/apache2/
and the default served dir is located at /var/www/ood
but these are also heavily controlled by puppet. For the rest of this config documentation I'll be categorizing by the file names, but I'll also try to refer to the puppet-openondemand class for that file as well.
Apps in OnDemand are located in /var/www/ood/apps/sys/<app name>
. The OOD dashboard itself is considered an app and is located here. The \"dev\" made apps are cloned here by puppet (openondemand::install_apps:
) from hpccf's github (i.e. https://github.com/ucdavis/hpccf-ood-jupyter). OOD apps are, put simply, sbatch scripts that are generated from ERB templates. Inside the app's directory, what is of most interest to admins is the: form.yml
, submit.yml
, and the template/
directory. I would guess that the majority of troubleshooting is happening here. Note that any of the files within this dir can end in .erb
if the you want its contents dynamically generated. To learn more about apps you can find the docs here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive.html
This file represents the form users fill out and the fields for selecting clusters, partitions, cpu, mem, etc. If you wanted to add another field you can do it here. Or, if you suspect there's a bug with the web form I recommend starting here. More about form.yml can be found here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive/form.html#
"},{"location":"admin/ondemand/#submityml","title":"submit.yml","text":"This file contains the contents of the sbatch job as well as job submission parameters that are submitted to slurm (or whatever scheduler you are using). Also, here you can configure the shell environment in which the app is run. If you suspect a bug might be a slurm, slurm submission, or a user environment issue I'd start here. More about submit.yml can be found here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive/submit.html
"},{"location":"admin/ondemand/#template","title":"template/","text":"This directory is the template for the sbatch job that the interactive app is run in. Any code, assets, etc. necessary with the app itself should be included here. When a user launches an OOD app this directory is processed by ERB templating system then copied to ~/ondemand/data/sys/dashboard/batch_connect/sys/...
. In this directory you may see three files of interest to admins: before.sh
, script.sh
, after.sh
. As their names suggest there's a script that runs before the job, one after, and the job itself. OOD starts by running the main script influenced by submit.yml
and forks thrice to run the before.sh
, script.sh
, and after.sh
. More about template/
can be found here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive/template.html
This is just the html view of the app form. I doubt you need to be editing this.
"},{"location":"admin/ondemand/#manifestyml","title":"manifest.yml","text":"This is where you set the app's name that shows on the dashboard and the app's category. More about manifest.yml
can be found here: https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive/manifest.html
If you want to edit, add, or create OOD apps you must be enabled as a dev app developer. In puppet this is done by placing your username under openondemand::dev_app_users:
and puppet will then do the following:
mkdir -p /var/www/ood/apps/dev/<username>\nsudo ln -s ~/ondemand/dev /var/www/ood/apps/dev/<username>/gateway\n
After that, you can git clone
apps to your OOD app developer environment located in ~/ondemand/dev/
. Your dev apps will show in a separate sidebar from the production ood apps and won't be visible by anyone else unless shared."},{"location":"admin/ondemand/#clustersd","title":"clusters.d","text":"/etc/ood/config/clusters.d/
is the config dir where OOD is coupled with a cluster and global scheduler options are specified. For OOD apps to work and be submitted to a cluster this yaml needs to be present and must be named after the cluster's hostname i.e. /etc/ood/config/clusters.d/farm.yml
. This area is controlled by puppet under openondemand::clusters:
. The most relevant section of this file for people not named Teddy is batch_connect:
, and more specifically the script_wrapper:
, is where you can put shell commands that will always run when an OOD app is ran.
batch_connect:\n basic:\n script_wrapper: |\n source /etc/profile\n module purge\n %s\n set_host: host=$(facter fqdn)\n vnc:\n script_wrapper: |\n source /etc/profile\n module purge\n module load conda/websockify turbovnc\n export WEBSOCKIFY_CMD=\"websockify\"\n turbovnc-ood\n %s\n set_host: host=$(facter fqdn)\n
"},{"location":"admin/ondemand/#script-wrapper","title":"Script Wrapper","text":"Under batch_connect:
is the script wrappers listed by the parent app category. Apps like JupyterLab and RStudio are in the basic
category, and VNC has its own category. Anything set in the script_wrapper:
under the app category is always run when an app of that category is run. So if you add a module load openmpi
to the script wrapper under basic:
then that will be ran, and openmpi will be loaded, whenever RStudio or JupyterLab is started. The %s
is a placeholder for all the scripts from the aforementioned template/
dir . You can use the placeholder to differentiate whether you want your commands to be run before or after your OOD app is started.
The facter fqdn
within set_host:
key should resolve to the fqdn of the compute/gpu node the job is running on.
More about clusters.d/
can be found here: https://osc.github.io/ood-documentation/latest/installation/add-cluster-config.html
/etc/ood/config/ood_portal.yml
is the top most config for OOD. Here be dragons, don't edit this file unless you know what you are doing! Here you can set the server name and port number that OOD will listen on. As well as, OOD related apache configs, certs, proxies, CAS confs, root uri, node uri, logout uri, etc.
Once a user authenticates with OOD, apache then starts the PUN as the user. /etc/ood/config/nginx_stage.yml
determines all the properties of the PUN including global settings for every user's shell env. If you suspect a bug is a user shell env problem, first check the local app env configs set in: submit.yml
in the app's directory first. More about nginx_stage.yml can be found here: https://osc.github.io/ood-documentation/latest/reference/files/nginx-stage-yml.html
You can make an announcement to be displayed within a banner on OOD by creating a yml or md file in /etc/ood/config/announcements.d/
. When any user navigates to OOD's dashboard, OOD will check here for the existence of any files.
Here's an example announcement yaml:
type: warning\nmsg: |\nOn Monday, September 24 from 8:00am to 12:00pm there will be a **Maintenece downtime**, which will prevent SSH login to compute nodes and running OnDemand \n
You can also create an test-announcement.yml.erb
to take advantage of ERB ruby templating. More about OOD announcements can be found here: https://osc.github.io/ood-documentation/latest/customizations.html#announcements
You can have the OOD dashboard display the system MOTD by setting these environment variables:
MOTD_PATH=\"/etc/motd\" # this supports both file and RSS feed URIs\nMOTD_FORMAT=\"txt\" # markdown, txt, rss, markdown_erb, txt_erb\n
In /etc/ood/config/apps/dashboard/env
/etc/ood/config/ondemand.d/
is home to nearly all other OOD configs not mentioned here (i.e. ticket submission, nav customizations, branding, etc.). The contents are controlled by puppet, under openondemand::confs:
, and the puppet formatting to properly place yamls here is as follows:
openondemand::conf:\n<name of yaml (i.e. tickets; If you want to create a tickets.yml)>\n data: (denotes the content to put in yaml)\n <yaml key>: <yaml value>\n
support_ticket:\n data:\n support_ticket:\n email:\n from: \"noreply@%{trusted.domain}\"\n to: hpc-help@ucdavis.edu\n
More about ondemand.d
, openondemand::confs
, and their function and format can be found here: https://osc.github.io/ood-documentation/latest/reference/files/ondemand-d-ymls.html and here: https://forge.puppet.com/modules/osc/openondemand/
sequenceDiagram\n user->>apache: `/var/log/apache2/error.log`\n apache->>CAS: `/var/cache/apache2/mod_auth_cas/`\n CAS->>apache: return\n apache->>pun: `/var/log/apache2/$fqdn_error.log`\n pun->>dashboard: `/var/log/ondemand-nginx/$user/error.log`\n dashboard->>oodapp: `$home/ondemand/data/sys/dashboard/batch_connect/sys/$app/output/$session_id/output.log`\n oodapp->>user: render\n
To start, all users who navigate to the ondemand website first encounter the apache server. Any errors encountered at this step will be in the log(s) at /var/log/apache2/error.log
Apache then redirects the users to CAS for authentication. You can grep -r $user /var/cache/apache2/mod_auth_cas/
to check if users have been authed to CAS and a cookie has been set.
CAS brings us back to apache and here apache runs all sorts of OOD Lua hooks. Any errors encountered at this step will be in the l og(s) at /var/log/apache2/$fqdn_error.log
Apache then starts an NginX server as the user and most things like the main dashboard, submitting jobs, running apps, etc happen here in the PUN. Any errors encountered at this step will be in the logs at /var/log/ondemand-nginx/$user/error.log
. You can also see what might be happening here by running commands like ps aux | grep $USER
to see the users PUN, or ps aux | grep -i nginx
to see all the PUNs. From the ondemand web UI theres an option to \"Restart Web Server\" which essentially kills and restarts the users PUN.
The dashboard is mostly covered in section 4, but just wanted to denote that apache then redirects us here after the PUN has been started where users can do everything else. At this step OOD will warn you about things like \"Home Directory Not Found\" and such. If you get this far I'd recommend you troubleshoot issues with users' home dir, NASii, and free space: df | grep $HOME
, du -sh $HOME
, journalctl -u autofs
, and umount stuff. Check that $HOME/ondemand
exists perhaps.
When users start an app like JuyterLab or a VNC desktop the job is submitted by the users' PUN and here OOD copies and renders (with ERB) the global app template from /var/www/ood/apps/sys/<app_name>/template/*
to $HOME/ondemand/data/sys/dashboard/batch_connect/sys/<app_name>/(output)/<session_id>
. Any errors encountered at this step will be in $HOME/ondemand/data/sys/dashboard/batch_connect/sys/<app_name>/(output)/<session_id>/*.log
.
Maybe the ondemand server is just in some invalid state and needs to be reset. I'd recommend you check the puppet conf at /etc/puppetlabs/puppet/puppet.conf
, run puppet agent -t
, and maybe restart the machine. Running puppet will force restart the apache server and regenerate OOD from the ood config yamls. Then you can restart the server by either ssh-ing to the server and running reboot
, or by ssh-ing to proxmox and running qm reset <vmid>
as root. TIP: you can find the vmid by finding the server in qm list
.
ondemand.farm.hpc.ucdavis.edu
","text":""},{"location":"admin/ondemand/#dev-doodvmfarmhpcucdavisedu","title":"Dev: dood.vm.farm.hpc.ucdavis.edu
","text":""},{"location":"admin/ondemand/#franklin","title":"Franklin","text":""},{"location":"admin/ondemand/#production-ondemandfranklinhpcucdavisedu","title":"Production: ondemand.franklin.hpc.ucdavis.edu
","text":""},{"location":"admin/ondemand/#hive","title":"Hive","text":""},{"location":"admin/ondemand/#production-ondemandhivehpcucdavisedu","title":"Production: ondemand.hive.hpc.ucdavis.edu
","text":""},{"location":"admin/provisioning/","title":"Provisioning","text":"(cobbler, etc)
"},{"location":"admin/software/","title":"Software Deployment","text":""},{"location":"admin/software/#spack","title":"Spack","text":""},{"location":"admin/software/#conda","title":"Conda","text":""},{"location":"admin/software/#other","title":"Other","text":""},{"location":"admin/vms/","title":"Virtual Machines","text":"HPCCF uses Proxmox for virtualization. Current servers are proxmox1
, proxmox2
, and proxmox3
.
To log in, point your browser to port 8006
on any of the proxmox servers, and choose UCD-CAS
as the realm. You'll need to be on the HPC VLAN to access the interface.
Use Netbox to locate a free IP address, or allocate one in the appropriate cobbler server. See provisioning for more information on selecting an IP/hostname and setting up PXE.
"},{"location":"admin/vms/#general","title":"General","text":"Choose an unused VM ID. Storage areas are pre-created on the DDN, on directory per VM ID. If more need to be created, see the DDN documentation. Populate the \"Name\" field with your chosen VM name.
"},{"location":"admin/vms/#os","title":"OS","text":"If you're installing a machine via PXE from a cobbler server, choose \"Do not use any media.\"
To add a new ISO, copy it to /mnt/pve/DDN-ISOs/template/iso/
on one of the proxmox hosts.
Check the Qemu Agent
box.
Defaults are fine. Adjust disk size as needed.
"},{"location":"admin/vms/#cpu","title":"CPU","text":"Use type x86-64-v3
. Adjust cores to taste.
Recent Ubuntu installer will fail unless you use at least 4096.
"},{"location":"admin/vms/#network","title":"Network","text":"See Netbox for a list of vlans.
Make sure to select VirtIO (paravirtualized)
for the network type.
Do not forget to add to DNS.
If this is a production VM, add the \"production\" tag.
"},{"location":"farm/","title":"Farm","text":"Farm is a Linux-based supercomputing cluster for the College of Agricultural and Environmental Sciences at UC Davis. Designed for both research and teaching, it is a significant campus resource primarily for CPU and RAM-based computing, with a wide selection of centrally-managed software available for research in genetics, proteomics, and related bioinformatics pipelines, weather and environmental modeling, fluid and particle simulations, geographic information system (GIS) software, and more.
For buying in resources in Farm cluster, contact CEAS IT director Adam Getchell - acgetchell@ucdavis.edu
"},{"location":"farm/#farm-hardware","title":"Farm Hardware","text":"Farm is an evolving cluster that changes and grows to meet the current needs of researchers, and has undergone three phases, with Farm III as the most recent evolution.
Farm III consists of 32 parallel nodes with up to 64 CPUs and 256GB RAM each in low2/med2/high2, plus 17 \u201cbigmem\u201d nodes with up to 96 CPUs and 1TB RAM each in the bml/bmm/bmh queue. All Farm III bigmem and newer parallel nodes and storage are on EDR/100Gbit interconnects. Older parallel nodes and storage are on FDR/55Gbit.
Farm II consists of 95 parallel nodes with 24 CPUs and 64GB RAM each in low/med/high, plus 9 \u201cbigmem\u201d nodes with 64 CPUs and 512GB RAM each in the bigmeml/bigmemm/bigmemh queues, and 1 additional node with 96 CPUs and 1TB RAM. Farm II nodes are on QDR/32Gbit interconnects.
Hardware from both Farm II and Farm III are still in service; Farm I has been decommissioned as of 2014.
Farm also has multiple file servers with over 5.3PB of storage space in total.
"},{"location":"farm/#access-to-farm","title":"Access to Farm","text":"All researchers in CA&ES are entitled to free access to:
8 nodes with 24 CPUs and 64GB RAM each (up to a maximum of 192 CPUs and 512GB RAM) in Farm II\u2019s low, medium, and high-priority batch queues,
4 nodes with 352 CPUs and 768GB RAM each in Farm III's low2, med2, and high2-priority batch queues.
The bml (bigmem, low priority/requeue) partition, which has 24 nodes with a combined 60 TB of RAM.
In addition to this, each new user is allocated a 20GB home directory. If you want to use the CA&ES free tier, select \u201cCA&ES free tier\" from the list of sponsors here.
Additional usage and access may be purchased by contributing to Farm III by through the node and/or storage rates or by purchasing equipment and contributing through the rack fee rate.
Contributors always receive priority access to the resources that they have purchased within one minute with the \u201cone-minute guarantee.\u201d Users can also request additional unused resources on a \u201cfair share\u201d basis in the medium or low partitions.
"},{"location":"farm/#farm-administration","title":"Farm Administration","text":"Farm hardware and software are administrated by the HPC Core Facility Team.
"},{"location":"farm/#current-rates","title":"Current Rates","text":"As of October 2023, the rates for Farm III:
Node and Storage Rates (each buy-in guarantees access for 5 years): -
Equipment may be purchased directly by researchers based on actual cost. Equipment quote available upon request.
Sponsor - CAES Information about Partitions - what is low, med and high and what are available GPUs on Farm?
Free tier access to 20GB capped storage.
Low partition - Internittent access to idle resources abbove limit
Medium Partition - Shared use of idle resources above permitted limit
High Partition - Dedicated use of invested resource.
CPU threads - 15,384
GPU count - 29
Aggregated RAM - 66 TB
Maximum RAM per node - 2TB
Node Count - 202
Inter-Connect - 200Gbps
Total number of Users - 726/328
"},{"location":"franklin/","title":"Franklin","text":"Franklin is a high performance computing (HPC) cluster for the College of Biological Sciences at UC Davis. Its primary use is for research in genetics, genomics, and proteomics, structural biology via cryogenic electron microscopy, computational neuroscience, and generally, the computational biology workflows related to those fields. Franklin currently consists of 7 AMD CPU nodes each with 128 physical and 256 logical cores and 1TB of RAM, 9 GPU nodes with a total of 72 Nvidia RTX A4000, RTX A5000, RTX 6000 Ada, and RTX 2080 TI GPUs, and a collection of ZFS file servers providing approximately 3PB of storage.
"},{"location":"franklin/scheduling/","title":"Job Scheduling","text":""},{"location":"franklin/scheduling/#partitions","title":"Partitions","text":""},{"location":"franklin/scheduling/#quality-of-service","title":"Quality of Service","text":""},{"location":"franklin/storage/","title":"Storage","text":""},{"location":"franklin/storage/#home-directories","title":"Home Directories","text":"All users are allocated 20GB of storage for their home directory. This space is free and not associated with lab storage quotas.
"},{"location":"franklin/storage/#lab-storage-allocations","title":"Lab Storage Allocations","text":"Research data should be stored on lab storage allocations. These allocations are mounted at /group/[PI_GROUP_NAME]
. N ote that these directories are mounted as-needed, so your particular allocation might not show up when you run ls /group
; you will need to access the path directly. You can find your PI group name by running groups
: this will output your user name and a name ending in grp
. The latter corresponds to the directory name under /group
, unless otherwise requested by your PI.
Franklin has a deployment of deepmind's alphafold as well as its databases compiled from source1. The databases are located at /share/databases/alphafold
; this directory is exported as $ALPHAFOLD_DB_ROOT
when the module is loaded.
This is not a docker deployment. As such, many of the examples provided online need to be slightly modified. The main script supplied by the alphafold package is run_alphafold.py
, which is the script that the docker container calls internally. All the same command line arguments that can be passed to alphafold's run_docker.py
script can be passed to run_alphafold.py
, but the latter requires all the database locations be supplied:
run_alphafold.py \\\n--data_dir=\"$ALPHAFOLD_DB_ROOT\" \\\n--uniref90_database_path=$ALPHAFOLD_DB_ROOT/uniref90/uniref90.fasta \\\n--mgnify_database_path=$ALPHAFOLD_DB_ROOT/mgnify/mgy_clusters_2022_05.fa \\\n--template_mmcif_dir=$ALPHAFOLD_DB_ROOT/pdb_mmcif/mmcif_files \\\n--obsolete_pdbs_path=$ALPHAFOLD_DB_ROOT/pdb_mmcif/obsolete.dat \\\n--bfd_database_path=$ALPHAFOLD_DB_ROOT/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \\\n--uniref30_database_path=$ALPHAFOLD_DB_ROOT/uniref30/UniRef30_2021_03 \\\n--pdb70_database_path=$ALPHAFOLD_DB_ROOT/pdb70/pdb70 \\\n[OTHER ARGS]\n
Because this is annoying, we have supplied a wrapper script named alphafold-wrapped
with our module that passes these common options for you. Any of the arguments not passed above will be passed along to the run_alphafold.py
script; for example:
alphafold-wrapped \\\n--output_dir=[OUTPUTS] \\\n--fasta_paths=[FASTA INPUTS] \\\n--max_template_date=2021-11-01 \\\n--use_gpu_relax=true\n
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., \u2026 Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Springer Science and Business Media LLC. https://doi.org/10.1038/s41586-021-03819-2 \u21a9
Franklin has multiple CPU and GPU optimized versions of the Relion cryo-EM structural determination package. The head node has been configured to support X11 forwarding, enabling the Relion GUI to be launched. Relion jobs are submitted for batch processing on the cluster node via Slurm. Each Relion module exports the necessary configurations to pre-fill job submission and dependency information in the GUI, and we have defined additional GUI fields to further configure Slurm parameters. We are also maintaining an additional software package, relion-helper
, to assist users in switching between Relion modules within the same project.
Your first step is deciding which Relion variant you should use. We recommend version 4.0.0, as it is the currently-supported stable release. There are three variants of this version: relion/cpu/4.0.0+amd
, relion/gpu/4.0.0+amd
, and relion/gpu/4.0.0+intel
, which correspond to the CPU optimized, GPU with AMD CPU optimized, and GPU with Intel CPU optimized builds, respectively. More information about these modules is available in the Module Variants section. In general, unless you have access to the three GPU nodes owned by the Al-Bassam lab, you can ignore the Intel variants, and use the CPU +amd
version for multi-node CPU only jobs and the GPU +amd
version if you have access to a GPU node.
If you are completely unfamiliar with Relion, you should start with the tutorial.
Note
Because Relion is GUI driven, you need to ssh
to Franklin with X11 forwarding enabled. Instructions for enabling X11 forwarding can be found in the Access section.
Make sure you have loaded one of the Relion modules:
$ module list relion\n\nCurrently Loaded Modules Matching: relion\n 1) relion/gpu/4.0.0+amd 2) relion-helper/0.2\n
Change your working directory your Relion project directory and type relion
. The Relion GUI should then pop up locally. There will be a bit of latency when using it, especially if you are off campus. You may be able to reduce latency by enabling SSH compression.
The relion start screen."},{"location":"franklin/software/cryoem/#dependency-configuration","title":"Dependency Configuration","text":"
The paths to software that different Relion jobs use will be automatically filled in. Editing these paths, unless you really, really know what you are doing, is not recommended and will likely result in problems, as some of these dependencies are compiled with architecture-specific flags that match their Relion variant.
Pre-filled dependent program path.
Danger
If you plan to switch between Relion modules within the same project, you must use the procedure described in the relion-helper section. Failure to do so will result in execution errors.
"},{"location":"franklin/software/cryoem/#slurm-configuration","title":"Slurm Configuration","text":"Our Relion deployment has additional fields in the Running tabs. These new fields are:
--mail-user
sbatch
/srun
parameter.--memory-per-cpu
parameter. Total RAM use of a job will be (Number of MPI procs) * (Number of Threads) * (Memory per CPU), when the Number of Threads field is available; otherwise it will be (Number of MPI procs) * (Memory per CPU).--time
parameter.TYPE:NUM
is supplied (example: a4000:4
), specific models of GPU will be requested. See the Resources section for more information on available GPU types. The relion/cpu
modules lack the GPU resources field. Note the submission script as well.
The relion/gpu
module has an extra field for GPU resources. Also note the differing submission script.
The default GUI fields serve their original purposes:
--ntasks
parameter. These tasks may be distributed across multiple nodes, depending on the number of Threads requested. For GPU runs, this should be the number of GPUs + 1.--cpus-per-task
parameter, which means it is the number of threads per MPI proc. Some job types do not expose this field, as they can only be run with a single-thread per MPI proc.--partition
parameter. More information on partitions can be found in the Queueing section./share/apps/spack/templates/hpccf/franklin
or in our spack GitHub repo.Sometimes, you may wish to use different Relion modules for different tasks while working within the same project -- perhaps you'd prefer to use the CPU-optimized version for CTF estimation and the GPU-optimized version for 3D refinement. This does not work out of the box. Relion fills the filesystem paths of its dependencies and templates from environment variables, and those environment variables are set in the modulefiles of the differing Relion builds. However, when a Relion job is run, those paths are cached in hidden .star
files in the project directory, and the next time Relion is run, it fills those paths from the cache files instead of the environment variables. This means that, after switching modules, the cached location of the previous module will be used, instead of the exported environment variables from the new module. This causes major breakage due to dependencies having different compilation options to match the parent Relion they are attached to and Slurm templates having different configuration options available.
Luckily, we have a solution! We wrote and are maintaining relion-helper, a simple utility that updates the cached variables in a project to match whatever Relion module is currently loaded. Let's go over example use of the tool.
In this example, assume we have a relion project directory at /path/to/my/project
. We ran some steps with the module relion/gpu/4.0.0+amd
, and now want to switch to relion/cpu/4.0.0+amd
. First, let's swap modules:
$ module unload relion/gpu/4.0.0+amd \namdfftw/3.2+amd: unloaded.\nctffind/4.1.14+amd: unloaded.\nrelion/gpu/4.0.0+amd: unloaded.\nmotioncor2/1.5.0: unloaded.\ngctf/1.06: unloaded.\nghostscript/9.56.1: unloaded.\n\n$ module load relion/cpu/4.0.0+amd.lua \namdfftw/3.2+amd: loaded.\nctffind/4.1.14+amd: loaded.\nrelion/cpu/4.0.0+amd: loaded.\nmotioncor2/1.5.0: loaded.\ngctf/1.06: loaded.\nghostscript/9.56.1: loaded.\n
And load relion-helper:
$ module load relion-helper \nrelion-helper/0.2: loaded.\n\n$ relion-helper -h\nusage: relion-helper [-h] {reset-cache} ...\n\npositional arguments:\n {reset-cache}\n\noptions:\n -h, --help show this help message and exit\n
Now, change to the project directory:
$ cd /path/to/my/project\n
Then, run the utility. It will pull the updated values from the appropriate environment variables that were exported by the new module and write them to the cache files in-place.
$ relion-helper reset-cache\n> .gui_ctffindjob.star:41:\n qsub_extra2: 2 => 10000\n> .gui_ctffindjob.star:42:\n qsub_extra3: 10000 => 12:00:00\n> .gui_ctffindjob.star:43:\n qsubscript: /share/apps/spack/templates/hpccf/franklin/relion.4.0.0.gpu.zen2.slurm.template.sh => \n/share/apps/spack/templates/hpccf/franklin/relion.4.0.0.cpu.slurm.template.sh\n> .gui_class2djob.star:53:\n qsub_extra2: 2 => 10000\n> .gui_class2djob.star:54:\n qsub_extra3: 10000 => 12:00:00\n> .gui_class2djob.star:55:\n qsubscript: /share/apps/spack/templates/hpccf/franklin/relion.4.0.0.gpu.zen2.slurm.template.sh => \n/share/apps/spack/templates/hpccf/franklin/relion.4.0.0.cpu.slurm.template.sh\n> .gui_autopickjob.star:63:\n qsub_extra2: 2 => 10000\n> .gui_autopickjob.star:64:\n qsub_extra3: 10000 => 12:00:00\n> .gui_autopickjob.star:65:\n qsubscript: /share/apps/spack/templates/hpccf/franklin/relion.4.0.0.gpu.zen2.slurm.template.sh => \n/share/apps/spack/templates/hpccf/franklin/relion.4.0.0.cpu.slurm.template.sh\n> .gui_importjob.star:38:\n qsub_extra2: 2 => 10000\n...\n
The above output is truncated for brevity. For each cached variable it updates, it reports the name of the cache file, the line number of the change, and the variable name and value of the change. You can now launch Relion and continue with your work.
Each time you want to switch Relion modules for a project, you will need to run this after loading the new module.
For now, relion-helper only has the reset-cache
subcommand. You can skip cd
ing to the project directory by passing the project directory to it instead:
$ relion-helper reset-cache -p /path/to/my/project\n
Although the changes are made in-place, it leaves backups of the modified files, in case you are concerned about bugs. The original files are of the form .gui_[JOBNAME].star
, and the backups are suffixed with .bak
:
$ ls -al /path/to/my/project\ntotal 317\ndrwxrwxr-x 10 camw camw 31 Feb 3 10:02 .\ndrwxrwxr-x 4 camw camw 6 Jan 12 12:58 ..\ndrwxrwxr-x 5 camw camw 5 Jan 12 12:46 .Nodes\ndrwxrwxr-x 2 camw camw 2 Jan 12 12:40 .TMP_runfiles\n-rw-rw-r-- 1 camw camw 1959 Feb 3 10:02 .gui_autopickjob.star\n-rw-rw-r-- 1 camw camw 1957 Feb 3 10:01 .gui_autopickjob.star.bak\n-rw-rw-r-- 1 camw camw 1427 Feb 3 10:02 .gui_class2djob.star\n-rw-rw-r-- 1 camw camw 1425 Feb 3 10:01 .gui_class2djob.star.bak\n-rw-rw-r-- 1 camw camw 1430 Feb 3 10:02 .gui_ctffindjob.star\n-rw-rw-r-- 1 camw camw 1428 Feb 3 10:01 .gui_ctffindjob.star.bak\n...\n
Warning
We do not recommend changing between major Relion versions within the same project: ie, from 3.0.1 to 4.0.0.
"},{"location":"franklin/software/cryoem/#module-variants","title":"Module Variants","text":"There are currently six variations of Relion available on Franklin. Versions 3.1.3 and 4.0.0 are available, each with:
relion/cpu/[VERSION]+amd
relion/gpu/[VERSION]+amd
relion/gpu/[VERSION]+intel
The CPU-optimized builds were configured with -DALTCPU=True
and without CUDA support. For Relion CPU jobs, they will be much faster than the GPU variants. The AMD-optimized +amd
variants were compiled with -DAMDFFTW=ON
and linked against the amdfftw
implementation of FFTW
, in addition to having Zen 2 microarchitecture flags specified to GCC. The +intel
variants were compiled with AVX2 support and configured with the -DMKLFFT=True
flag, so they use the Intel OneAPI MKL implementation of FFTW
. All the GPU variants are targeted to a CUDA compute version of 7.5. The full Cryo-EM software stack is defined in the HPCCF spack configuration repository, and we maintain our own Relion spack package definition. More information on the configurations described here can be found in the Relion docs.
The different modules may need to be used with different Slurm resource directives, depending on their variants. The necessary directives, given a module and job partition, are as follows:
Module Name Slurm Partition Slurm Directivesrelion/cpu/[3.1.3,4.0.0]+amd
low
--constraint=amd
relion/cpu/[3.1.3,4.0.0]+amd
high
N/A relion/gpu/[3.1.3,4.0.0]+amd
low
--constraint=amd --gres=gpu:[$N_GPUs]
or --gres=gpu:[a4000,a5000]:[$N_GPUs]
relion/gpu/[3.1.3,4.0.0]+amd
jalettsgrp-gpu
--gres=gpu:[$N_GPUs]
relion/gpu/[3.1.3,4.0.0]+amd
mmgdept-gpu
--gres=gpu:[$N_GPUs]
relion/gpu/[3.1.3,4.0.0]+intel
low
--constraint=intel --gres=gpu:[$N_GPUs]
or --gres=gpu:[rtx_2080_ti]:[$N_GPUs]
relion/gpu/[3.1.3,4.0.0]+intel
jawdatgrp-gpu
--gres=gpu:[$N_GPUs]
For example, to use the CPU-optimized Relion module relion/cpu/4.0.0+amd
on the free, preemptable low
partition, you should submit jobs with --constraint=amd
so as to eliminate the Intel nodes in that partition from consideration. However, if you have access to and are using the high
partition with the same module, no additional Slurm directives are required, as the high
partition only has CPU compute nodes. Alternatively, if you were using an AMD-optimized GPU version, like relion/gpu/4.0.0+amd
, and wished to use 2 GPUs on the low
partition, you would need to provide both the --constraint=amd
and a --gres=gpu:2
directive, in order to get an AMD node on the partition along with the required GPUs. Those with access to and submitting to the mmgdept-gpu
queue would need only to specify --gres=gpu:2
, as that partition only has AMD nodes in it.
Note
If you are submitting jobs via the GUI, these Slurm directives will already be taken care of for you. If you wish to submit jobs manually, you can get the path to Slurm submission template for the currently-loaded module from the $RELION_QSUB_TEMPLATE
environment variable; copying this template is a good starting place for building your batch scripts.
Our installation of CTFFIND4 has +amd
and +intel
variants which, like Relion, are linked against amdfftw
and Intel OneAPI MKL, respectively. The Slurm --constraint
flags should be used with these as well, when appropriate, as indicated in the Relion directive table. Each Relion module has its companion CTFFIND4 module as a dependency, so the appropriate version will automatically be loaded when you load Relion, and the proper environment variables are set for the Relion GUI to point at them.
We have deployed MotionCor2 binaries which have been patched to link against the appropriate version of CUDA. These are targetted at a generic architecture, as the source code is not available. Like CTFFIND4, this module is brought in by Relion and the proper environment variables set for Relion to use it.
"},{"location":"franklin/software/cryoem/#gctf","title":"Gctf","text":"Gctf binaries have been patched and deployed in the same manner as MotionCor2.
"},{"location":"franklin/software/modules/","title":"Modules","text":"Franklin currently uses lmod
, which is cross-compatible with envmod
. See our lmod
docuementation for more information on using the module system.
Many modules correspond to different versions of the same software, and some software has multiple variants of the same version. The default naming convention is NAME/VERSION
: for example, cuda/11.8.0
or mcl/14-137
. The version can be omitted when loading, in which case the highest-versioned module or the version marked as default (with a (D)
) will be used.
Some module names are structured as NAME/VARIANT/VERSION
. For these, the minimum name you can use for loading is NAME/VARIANT
: for example, you can load relion/gpu
or relion/cpu
, but just trying to module load relion
will fail.
Software is sometimes compiled with optimizations specific to certain hardware. These are named with the format NAME/VERSION+ARCH
or NAME/VARIANT/VERSION+arch
. For example, ctffind/4.1.14+amd
was compiled with AMD Zen2-specific optimizations and uses the amdfftw
implementation of the FFTW
library, and will fail on the Intel-based RTX2080 nodes purchased by the Al-Bassam lab (gpu-9-[10,18,26]
). Conversely, ctffind/4.1.14+intel
was compiled with Intel-specific compiler optimizations as well as linking against the Intel OneAPI MKL implementation of FFTW
, and is only meant to be used on those nodes. In all cases, the +amd
variant of a module, if it exists, is the default, as the majority of the nodes use AMD CPUs.
Software without a +ARCH
was compiled for a generic architecture and will function on all nodes. The generic architecture on Franklin is x86-64-v3
, which means they support AVX
, AVX2
, and all other previous SSE
and other vectorized instructions.
The various conda modules have their own naming scheme. These are of the form conda/ENVIRONMENT/VERSION
. The conda/base/VERSION
module(s) load the base conda environment and set the appropriate variables to use the conda activate
and deactivate
commands, while the the modules for the other environments first load conda/base
and then activate the environment to which they correspond. The the conda
section for more information on conda
and Python on Franklin.
These modules are built and managed by our Spack deployment. Most were compiled for generic architecture, meaning they can run on any node, but some are Intel or AMD specific, and some require GPU support.
"},{"location":"franklin/software/modules/#r","title":"R","text":"R is 'GNU S', a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information.
Versions: 4.1.1
Arches: generic
Modules: R/4.1.1
ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size.
Versions: 2.3.1
Arches: generic
Modules: abyss/2.3.1
FFTW (AMD Optimized version) is a comprehensive collection of fast C routines for computing the Discrete Fourier Transform (DFT) and various special cases thereof. It is an open-source implementation of the Fast Fourier transform algorithm. It can compute transforms of real and complex-values arrays of arbitrary size and dimension. AMD Optimized FFTW is the optimized FFTW implementation targeted for AMD CPUs. For single precision build, please use precision value as float. Example : spack install amdfftw precision=float
Versions: 3.2
Arches: amd
Modules: amdfftw/3.2+amd
Apache Ant is a Java library and command-line tool whose mission is to drive processes described in build files as targets and extension points dependent upon each other
Versions: 1.10.7
Arches: generic
Modules: ant/1.10.7
ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences.
Versions: 1.2.38
Arches: generic
Modules: aragorn/1.2.38
Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome.
Versions: 2.30.0
Arches: generic
Modules: bedtools2/2.30.0
Basic Local Alignment Search Tool.
Versions: 2.12.0
Arches: generic
Modules: blast-plus/2.12.0
Blast2GO is a bioinformatics platform for high-quality functional annotation and analysis of genomic datasets.
Versions: 5.2.5
Arches: generic
Modules: blast2go/5.2.5
BLAT (BLAST-like alignment tool) is a pairwise sequence alignment algorithm.
Versions: 35
Arches: generic
Modules: blat/35
Bowtie is an ultrafast, memory-efficient short read aligner for short DNA sequences (reads) from next-gen sequencers.
Versions: 1.3.0
Arches: generic
Modules: bowtie/1.3.0
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences
Versions: 2.4.2
Arches: generic
Modules: bowtie2/2.4.2
Burrow-Wheeler Aligner for pairwise alignment between DNA sequences.
Versions: 0.7.17
Arches: generic
Modules: bwa/0.7.17
bwtool is a command-line utility for bigWig files.
Versions: 1.0
Arches: generic
Modules: bwtool/1.0
A single molecule sequence assembler for genomes large and small.
Versions: 2.2
Arches: generic
Modules: canu/2.2
CAP3 is DNA Sequence Assembly Program
Versions: 2015-02-11
Arches: generic
Modules: cap3/2015-02-11
Clustal Omega: the last alignment program you'll ever need.
Versions: 1.2.4
Arches: generic
Modules: clustal-omega/1.2.4
Multiple alignment of nucleic acid and protein sequences.
Versions: 2.1
Arches: generic
Modules: clustalw/2.1
Corset is a command-line software program to go from a de novo transcriptome assembly to gene-level counts.
Versions: 1.09
Arches: generic
Modules: corset/1.09
Fast and accurate defocus estimation from electron micrographs.
Versions: 4.1.14
Arches: intel, amd
Modules: ctffind/4.1.14+intel
, ctffind/4.1.14+amd
CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). Note: This package does not currently install the drivers necessary to run CUDA. These will need to be installed manually. See: https://docs.nvidia.com/cuda/ for details.
Versions: 11.7.1, 11.8.0
Arches: generic
Modules: cuda/11.8.0
, cuda/11.7.1
Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples.
Versions: 2.2.1
Arches: generic
Modules: cufflinks/2.2.1
Command-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc. Primarily written to support an Illumina based pipeline - but should work with any FASTQs.
Versions: 2021-10-20
Arches: generic
Modules: ea-utils/2021-10-20
EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community
Versions: 6.6.0
Arches: generic
Modules: emboss/6.6.0
Pairwise sequence alignment of DNA and proteins
Versions: 2.4.0
Arches: generic
Modules: exonerate/2.4.0
This is an exonerate fork with added gff3 support. Original website with user guides: http://www.ebi.ac.uk/~guy/exonerate/
Versions: 2.3.0
Arches: generic
Modules: exonerate-gff3/2.3.0
A quality control tool for high throughput sequence data.
Versions: 0.11.9
Arches: generic
Modules: fastqc/0.11.9
FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST). We believe that FFTW, which is free software, should become the FFT library of choice for most applications.
Versions: 3.3.10
Arches: generic
Modules: fftw/3.3.10
Bayesian haplotype-based genetic polymorphism discovery and genotyping.
Versions: 1.3.6
Arches: generic
Modules: freebayes/1.3.6
Genome Analysis Toolkit Variant Discovery in High-Throughput Sequencing Data
Versions: 4.2.6.1, 3.8.1
Arches: generic
Modules: gatk/3.8.1
, gatk/4.2.6.1
The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Ada, and Go, as well as libraries for these languages.
Versions: 5.5.0, 4.9.4, 7.5.0
Arches: generic
Modules: gcc/5.5.0
, gcc/7.5.0
, gcc/4.9.4
a GPU accelerated program for Real-Time CTF determination, refinement, evaluation and correction.
Versions: 1.06
Arches: generic
Modules: gctf/1.06
Genrich is a peak-caller for genomic enrichment assays.
Versions: 0.6
Arches: generic
Modules: genrich/0.6
An interpreter for the PostScript language and for PDF.
Versions: 9.56.1
Arches: generic
Modules: ghostscript/9.56.1
Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses.
Versions: 3.02b
Arches: generic
Modules: glimmer/3.02b
HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
Versions: 1.12.2
Arches: generic
Modules: hdf5/1.12.2
HISAT2 is a fast and sensitive alignment program for mapping next- generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against the general human population (as well as against a single reference genome).
Versions: 2.2.0
Arches: generic
Modules: hisat2/2.2.0
HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
Versions: 3.3.2
Arches: generic
Modules: hmmer/3.3.2
Software for motif discovery and next generation sequencing analysis
Versions: 4.9.1
Arches: generic
Modules: homer/4.9.1
The Hardware Locality (hwloc) software project. The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across OS, versions, architectures, ...) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs. It primarily aims at helping applications with gathering information about modern computing hardware so as to exploit it accordingly and efficiently.
Versions: 2.8.0
Arches: generic
Modules: hwloc/2.8.0
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
Versions: 2.12.3
Arches: generic
Modules: igv/2.12.3
Infernal (INFERence of RNA ALignment) is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs).
Versions: 1.1.4
Arches: generic
Modules: infernal/1.1.4
Intel oneAPI Compilers. Includes: icc, icpc, ifort, icx, icpx, ifx, and dpcpp. LICENSE INFORMATION: By downloading and using this software, you agree to the terms and conditions of the software license agreements at https://intel.ly/393CijO.
Versions: 2022.2.1
Arches: generic
Modules: intel-oneapi-compilers/2022.2.1
Intel oneAPI Math Kernel Library (Intel oneMKL; formerly Intel Math Kernel Library or Intel MKL), is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math. LICENSE INFORMATION: By downloading and using this software, you agree to the terms and conditions of the software license agreements at https://intel.ly/393CijO.
Versions: 2022.2.1
Arches: generic
Modules: intel-oneapi-mkl/2022.2.1
InterProScan is the software package that allows sequences (protein and nucleic) to be scanned against InterPro's signatures. Signatures are predictive models, provided by several different databases, that make up the InterPro consortium.
Versions: 5.56-89.0
Arches: generic
Modules: interproscan/5.56-89.0
IQ-TREE Efficient software for phylogenomic inference
Versions: 2.1.3
Arches: generic
Modules: iq-tree/2.1.3
Efficient and versatile phylogenomic software by maximum likelihood
Versions: 2.1.2
Arches: generic
Modules: iqtree2/2.1.2
The Java Development Kit (JDK) released by Oracle Corporation in the form of a binary product aimed at Java developers. Includes a complete JRE plus tools for developing, debugging, and monitoring Java applications.
Versions: 17.0.1
Arches: generic
Modules: jdk/17.0.1
JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA.
Versions: 1.1.11
Arches: generic
Modules: jellyfish/1.1.11
A fast multiple sequence alignment program for biological sequences.
Versions: 3.3.1
Arches: generic
Modules: kalign/3.3.1
kallisto is a program for quantifying abundances of transcripts from RNA-Seq data.
Versions: 0.48.0
Arches: generic
Modules: kallisto/0.48.0
KmerGenie estimates the best k-mer length for genome de novo assembly.
Versions: 1.7044
Arches: generic
Modules: kmergenie/1.7044
Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
Versions: 1.0
Arches: generic
Modules: kraken/1.0
Kraken2 is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies.
Versions: 2.1.2
Arches: generic
Modules: kraken2/2.1.2
Metagenomics classifier with unique k-mer counting for more specific results.
Versions: 0.7.3
Arches: generic
Modules: krakenuniq/0.7.3
LAST finds similar regions between sequences, and aligns them. It is designed for comparing large datasets to each other (e.g. vertebrate genomes and/or large numbers of DNA reads).
Versions: 1282
Arches: generic
Modules: last/1282
The libevent API provides a mechanism to execute a callback function when a specific event occurs on a file descriptor or after a timeout has been reached. Furthermore, libevent also support callbacks due to signals or regular timeouts.
Versions: 2.1.12
Arches: generic
Modules: libevent/2.1.12
Fast genome and metagenome distance estimation using MinHash.
Versions: 2.3
Arches: generic
Modules: mash/2.3
MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches.
Versions: 4.0.9
Arches: generic
Modules: masurca/4.0.9
The MCL algorithm is short for the Markov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for graphs (also known as networks) based on simulation of (stochastic) flow in graphs.
Versions: 14-137
Arches: generic
Modules: mcl/14-137
MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
Versions: 1.1.4
Arches: generic
Modules: megahit/1.1.4
The MEME Suite allows the biologist to discover novel motifs in collections of unaligned nucleotide or protein sequences, and to perform a wide variety of other motif-based analyses.
Versions: 5.3.0
Arches: generic
Modules: meme/5.3.0
MetaEuk is a modular toolkit designed for large-scale gene discovery and annotation in eukaryotic metagenomic contigs.
Versions: 6-a5d39d9
Arches: generic
Modules: metaeuk/6-a5d39d9
MinCED is a program to find Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) in full genomes or environmental datasets such as metagenomes, in which sequence size can be anywhere from 100 to 800 bp.
Versions: 0.3.2
Arches: generic
Modules: minced/0.3.2
Miniasm is a very fast OLC-based de novo assembler for noisy long reads.
Versions: 2018-3-30
Arches: generic
Modules: miniasm/2018-3-30
Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Mappy provides a convenient interface to minimap2.
Versions: 2.14
Arches: generic
Modules: minimap2/2.14
miRDeep2 is a completely overhauled tool which discovers microRNA genes by analyzing sequenced RNAs.
Versions: 0.0.8
Arches: generic
Modules: mirdeep2/0.0.8
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets
Versions: 14-7e284
Arches: generic
Modules: mmseqs2/14-7e284
This project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community.
Versions: 1.48.0
Arches: generic
Modules: mothur/1.48.0
MotionCor2 is a multi-GPU program that corrects beam-induced sample motion recorded on dose fractionated movie stacks. It implements a robust iterative alignment algorithm that delivers precise measurement and correction of both global and non-uniform local motions at single pixel level, suitable for both single-particle and tomographic images. MotionCor2 is sufficiently fast to keep up with automated data collection.
Versions: 1.5.0
Arches: generic
Modules: motioncor2/1.5.0
MUMmer is a system for rapidly aligning entire genomes.
Versions: 3.23
Arches: generic
Modules: mummer/3.23
MUMmer is a versatil alignment tool for DNA and protein sequences.
Versions: 4.0.0rc1
Arches: generic
Modules: mummer4/4.0.0rc1
MUSCLE is one of the best-performing multiple alignment programs according to published benchmark tests, with accuracy and speed that are consistently better than CLUSTALW.
Versions: 3.8.1551
Arches: generic
Modules: muscle/3.8.1551
RMBlast search engine for NCBI
Versions: 2.11.0
Arches: generic
Modules: ncbi-rmblastn/2.11.0
NCBI C++ Toolkit
Versions: 26_0_1
Arches: generic
Modules: ncbi-toolkit/26_0_1
The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives. This package contains the interface to the VDB.
Versions: 3.0.0
Arches: generic
Modules: ncbi-vdb/3.0.0
Data-driven computational pipelines.
Versions: 22.10.1
Arches: generic
Modules: nextflow/22.10.1
The free and opensource java implementation
Versions: 11.0.17_8, 16.0.2
Arches: generic
Modules: openjdk/11.0.17_8
, openjdk/16.0.2
OpenLDAP Software is an open source implementation of the Lightweight Directory Access Protocol. The suite includes: slapd - stand-alone LDAP daemon (server) libraries implementing the LDAP protocol, and utilities, tools, and sample clients.
Versions: 2.4.49
Arches: generic
Modules: openldap/2.4.49
An open source Message Passing Interface implementation. The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.
Versions: 4.1.4
Arches: generic
Modules: openmpi/4.1.4
OrthoFinder is a fast, accurate and comprehensive analysis tool for comparative genomics. It finds orthologues and orthogroups infers rooted gene trees for all orthogroups and infers a rooted species tree for the species being analysed. OrthoFinder also provides comprehensive statistics for comparative genomic analyses. OrthoFinder is simple to use and all you need to run it is a set of protein sequence files (one per species) in FASTA format.
Versions: 2.5.4
Arches: generic
Modules: orthofinder/2.5.4
OrthoMCL is a genome-scale algorithm for grouping orthologous protein sequences.
Versions: 2.0.9
Arches: generic
Modules: orthomcl/2.0.9
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input.
Versions: 20220522
Arches: generic
Modules: parallel/20220522
PatchELF is a small utility to modify the dynamic linker and RPATH of ELF executables.
Versions: 0.16.1
Arches: generic
Modules: patchelf/0.16.1
PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees).
Versions: 3.697
Arches: generic
Modules: phylip/3.697
Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
Versions: 2.26.2
Arches: generic
Modules: picard/2.26.2
Pilon is an automated genome assembly improvement and variant detection tool.
Versions: 1.22
Arches: generic
Modules: pilon/1.22
PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
Versions: 1.07
Arches: generic
Modules: plink/1.07
The Process Management Interface (PMI) has been used for quite some time as a means of exchanging wireup information needed for interprocess communication. However, meeting the significant orchestration challenges presented by exascale systems requires that the process-to-system interface evolve to permit a tighter integration between the different components of the parallel application and existing and future SMS solutions. PMI Exascale (PMIx) addresses these needs by providing an extended version of the PMI definitions specifically designed to support exascale and beyond environments by: (a) adding flexibility to the functionality expressed in the existing APIs, (b) augmenting the interfaces with new APIs that provide extended capabilities, (c) forging a collaboration between subsystem providers including resource manager, fabric, file system, and programming library developers, (d) establishing a standards-like body for maintaining the definitions, and (e) providing a reference implementation of the PMIx standard that demonstrates the desired level of scalability while maintaining strict separation between it and the standard itself.
Versions: 4.1.2
Arches: generic
Modules: pmix/4.1.2
Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.
Versions: 1.14.6
Arches: generic
Modules: prokka/1.14.6
R is 'GNU S', a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information.
Versions: 4.2.0
Arches: generic
Modules: r/4.2.0
RAxML-NG is a phylogenetic tree inference tool which uses maximum- likelihood (ML) optimality criterion. Its search heuristic is based on iteratively performing a series of Subtree Pruning and Regrafting (SPR) moves, which allows to quickly navigate to the best-known ML tree. RAxML-NG is a successor of RAxML (Stamatakis 2014) and leverages the highly optimized likelihood computation implemented in libpll (Flouri et al. 2014).
Versions: 1.0.2
Arches: generic
Modules: raxml-ng/1.0.2
Parallel genome assemblies for parallel DNA sequencing
Versions: 2.3.1
Arches: generic
Modules: ray/2.3.1
Rclone is a command line program to sync files and directories to and from various cloud storage providers
Versions: 1.59.1
Arches: generic
Modules: rclone/1.59.1
RECON: a package for automated de novo identification of repeat families from genomic sequences.
Versions: 1.05
Arches: generic
Modules: recon/1.05
RELION (for REgularised LIkelihood OptimisatioN, pronounce rely-on) is a stand-alone computer program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM).
Versions: 4.0.0, 3.1.3, 4.0.1
Variants: gpu, cpu
Arches: intel+gpu_delay, intel, amd
Modules: relion/cpu/3.1.3+amd
, relion/gpu/3.1.3+amd
, relion/gpu/3.1.3+intel
, relion/cpu/4.0.0+amd
, relion/gpu/4.0.0+amd
, relion/gpu/4.0.0+intel
, relion/3.1.3
, relion/4.0.0
, relion/cpu/4.0.1+amd
, relion/gpu/4.0.1+amd
, relion/4.0.1
, relion/gpu/4.0.1+intel+gpu_delay
, relion/gpu/4.0.1+intel
A modified version of Relion supporting block-based-reconstruction as described in 10.1038/s41467-018-04051-9.
Versions: 3.1.2
Variants: gpu
Arches: intel
Modules: relion-bbr/gpu/3.1.2+intel
Utilities for Relion Cryo-EM data processing on clusters.
Versions: 0.1, 0.2, 0.3
Arches: generic
Modules: relion-helper/0.1
, relion-helper/0.2
, relion-helper/0.3
RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Versions: 4.0.9
Arches: generic
Modules: repeatmasker/4.0.9
RepeatModeler is a de-novo repeat family identification and modeling package.
Versions: 1.0.11
Arches: generic
Modules: repeatmodeler/1.0.11
RepeatScout - De Novo Repeat Finder, Price A.L., Jones N.C. and Pevzner P.A.
Versions: 1.0.5
Arches: generic
Modules: repeatscout/1.0.5
Quality assessment of de novo transcriptome assemblies from RNA-Seq data rnaQUAST is a tool for evaluating RNA-Seq assemblies using reference genome and gene database. In addition, rnaQUAST is also capable of estimating gene database coverage by raw reads and de novo quality assessment using third-party software.
Versions: 2.2.0
Arches: generic
Modules: rnaquast/2.2.0
RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data.
Versions: 1.3.1
Arches: generic
Modules: rsem/1.3.1
RStudio is an integrated development environment (IDE) for R.
Versions: 2022.12.0-353
Arches: generic
Modules: rstudio-server/2022.12.0-353
Sabre is a tool that will demultiplex barcoded reads into separate files. It will work on both single-end and paired-end data in fastq format. It simply compares the provided barcodes with each read and separates the read into its appropriate barcode file, after stripping the barcode from the read (and also stripping the quality values of the barcode bases). If a read does not have a recognized barcode, then it is put into the unknown file.
Versions: 2013-09-27
Arches: generic
Modules: sabre/2013-09-27
Satsuma2 is an optimsed version of Satsuma, a tool to reliably align large and complex DNA sequences providing maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accomodate the billions of base pairs in vertebrate genomes).
Versions: 2021-03-04
Arches: generic
Modules: satsuma2/2021-03-04
Scallop is a reference-based transcriptome assembler for RNA-seq
Versions: 0.10.5
Arches: generic
Modules: scallop/0.10.5
SeqPrep is a program to merge paired end Illumina reads that are overlapping into a single longer read.
Versions: 1.3.2
Arches: generic
Modules: seqprep/1.3.2
Toolkit for processing sequences in FASTA/Q formats.
Versions: 1.3
Arches: generic
Modules: seqtk/1.3
Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads.
Versions: 1.33
Arches: generic
Modules: sickle/1.33
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Versions: 22-05-6-1
Arches: generic
Modules: slurm/22-05-6-1
SMARTdenovo is a de novo assembler for PacBio and Oxford Nanopore (ONT) data.
Versions: master
Arches: generic
Modules: smartdenovo/master
SortMeRNA is a program tool for filtering, mapping and OTU-picking NGS reads in metatranscriptomic and metagenomic data
Versions: 2017-07-13
Arches: generic
Modules: sortmerna/2017-07-13
A flexible read trimming tool for Illumina NGS data.
Versions: 0.39
Arches: generic
Modules: trimmomatic/0.39
a communication library implementing high-performance messaging for MPI/PGAS frameworks
Versions: 1.13.1
Arches: generic
Modules: ucx/1.13.1
In order to access your HPC account, you may need to generate an SSH key pair for authorization. You generate a pair of keys: a public key and a private key. The private key is kept securely on your computer or device. The public key is submitted to HPCCF to grant you access to a cluster.
"},{"location":"general/access/#how-do-i-generate-an-ssh-key-pair","title":"How do I generate an SSH key pair?","text":""},{"location":"general/access/#windows-operating-system","title":"Windows Operating System","text":"We recommend MobaXterm as the most straightforward SSH client. You can download the free Home Edition (Installer Edition) from MobaXterm. Please download the Installer Edition. The Portable Edition deletes the contents of your home directory by default when it exits, which will remove your freshly generate SSH key. Once you install the stable version of MobaXterm, open its terminal and enter this command:
ssh-keygen
This command will create a private key and a public key. Do not share your private key; we recommend giving it a passphrase for security. To view the .ssh directory and to read the public key, enter these commands:
ls -al ~/.ssh\ncat ~/.ssh/*.pub\n
"},{"location":"general/access/#macos-and-linux","title":"macOS and Linux:","text":"Use a terminal to create an SSH key pair using the command:
ssh-keygen
To view the .ssh directory and to read the public key, enter these commands:
ls -al ~/.ssh\ncat ~/.ssh/*.pub\n
"},{"location":"general/access/#x11-forwarding","title":"X11 Forwarding","text":"Some software has a Graphical User Interface (GUI), and so requires X11 to be enabled. X11 forwarding allows an application on a remote server (in this case, Franklin) to render its GUI on a local system (your computer). How this is enabled depends on the operating system the computer you are using to access Franklin is running.
"},{"location":"general/access/#linux","title":"Linux","text":"If you are SSHing from a Linux distribution, you likely already have an X11 server running locally, and can support forwarding natively. If you are on campus, you can use the -Y
flag to enable it, like:
$ ssh -Y [USER]@[CLUSTER].hpc.ucdavis.edu\n
If you are off campus on a slower internet connection, you may get better performance by enabling compression with:
$ ssh -Y -C [USER]@[CLUSTER].hpc.ucdavis.edu\n
If you have multiple SSH key pairs, and you want to use a specific private key to connect to the clusters, use the otpion -i
to specify path to the private key with SSH: $ ssh -i ~/.ssh/id_hpc [USER]@[CLUSTER].hpc.ucdavis.edu\n
"},{"location":"general/access/#macos","title":"macOS","text":"macOS does not come with an X11 implementation out of the box. You will first need to install the free, open-source XQuartz package, after which you can use the same ssh
flags as described in the Linux instructions.
If you are using our recommend windows SSH client (MobaXterm) X11 forwarding should be enabled by default. You can confirm this by checking that the X11-Forwarding
box is ticked under your Franklin session settings. For off-campus access, you may want to tick the Compression
box as well.
HPC accounts are provisioned on a per-cluster basis and granted with the permission of their principal investigator. Accounts that are provisioned under each PI will have access to that PI's purchased resources and their own separate home directory.
Access to HPC clusters is granted via the use of SSH keys. An SSH public key is required to generate an account. For information on creating SSH keys, please visit the access documentation page.
"},{"location":"general/account-requests/#hippo","title":"HiPPO","text":"The High-Performance Personnel Onboarding (HiPPO) portal can provision resources for the Farm, Franklin, and Peloton HPC clusters. Users can request an account on HiPPO by logging in with UC Davis CAS and selecting their PI.
Users who do not have a PI and are interested in sponsored tiers for Farm can request an account by selecting the IT director for CAES, Adam Getchell, as their PI.
Users who do not have a PI and who are affiliated with the College of Letters and Science can request an sponsored account on Peloton by selecting the IT director for CLAS, Jeremy Phillips as their PI.
"},{"location":"general/account-requests/#hpc1-and-hpc2","title":"HPC1 and HPC2","text":"Users who are associated with PI's in the College of Engineering can request accounts on HPC1 and HPC2 by going to the appropriate web form.
"},{"location":"general/account-requests/#lssc0-barbera","title":"LSSC0 (Barbera)","text":"Users who want access to resources on LSSC0 can request an account within the Genome Center Computing Portal and selecting 'Request an Account' with their PI.
"},{"location":"general/account-requests/#atomate","title":"Atomate","text":"Atomate accounts can be requested here.
"},{"location":"general/account-requests/#cardio-demon-impact","title":"Cardio, Demon, Impact","text":"Accounts on these systems can be requested here.
"},{"location":"general/troubleshooting/","title":"Troubleshooting","text":""},{"location":"general/troubleshooting/#common-ssh-issues","title":"Common SSH Issues","text":"Here are some of the most common issues users face when using SSH.
"},{"location":"general/troubleshooting/#keys","title":"Keys","text":"The following clusters use SSH keys: Atomate, Farm, Franklin, HPC1, HPC2, Impact, Peloton.
If you connect to one of these and are asked for a password (as distinct from a passphrase for your key), your key is not being recognized. This is usually because of permissions or an unexpected filename. SSH expects your key to be one of a specific set of names. Unless you have specified something other than the default, this is probably going to be $HOME/.ssh/id_rsa
.
If you specified a different name when generating your key, you can specify it like this:
ssh -i $HOME/.ssh/newkey [USER]@[cluster].hpc.ucdavis.edu\n
If you kept the default value, your permissions should be set so that only you can read and write the key (-rw------- or 600)
. To ensure this is the case, you can do the following:
chown 600 $HOME/.ssh/id_rsa\n
On HPC2, your public key is kept in $HOME/.ssh/authorized_keys
. Please make sure to not remove your key from this file. Doing so will cause you will lose access.
If you are trying to use a key to access LSSC0 or any of the Genome Center login nodes, SSH keys will not work, but there is another method.
To enable logins without a password, you will need to enable GSSAPI, which some systems enable by default. If not enabled, add the following to your $HOME/.ssh/config
file (create it if it doesn't exist):
GSSAPIAuthentication yes\nGSSAPIDelegateCredentials yes\n
The -K
command line switch to ssh does the same thing on a one-time basis.
Once you have GSSAPI
enabled, you can get a Kerberos ticket using
kinit [USER]@GENOMECENTER.UCDAVIS.EDU\n
SSH will use that ticket while it's valid.
"},{"location":"general/troubleshooting/#common-slurm-scheduler-issues","title":"Common Slurm Scheduler Issues","text":"These are the most common issues with job scheduling using Slurm.
"},{"location":"general/troubleshooting/#using-a-non-default-account","title":"Using a non-default account","text":"If you have access to more than one Slurm account and wish to use an account other than your default, use the -A
or --account
flag.
e.g. If your default account is in foogrp
and you wish to use bargrp
:
srun -A bargrp -t 1:00:00 --mem=20GB scriptname.sh\n
"},{"location":"general/troubleshooting/#no-default-account","title":"No default account","text":"Newer slurm accounts have no default specified, and in this case you might get error message like:
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified\n
You will need to specify the account explicitly as explained above. You can find out how to view your Slurm account information in the resources section.
"},{"location":"scheduler/","title":"Job Scheduling","text":"HPC clusters run job schedulers to distribute and manage computational resources. Generally, schedulers:
HPCCF clusters use Slurm for job scheduling. A central controller runs on one of the file servers, which users submit jobs to from the access node using the srun
and sbatch
commands. The controller then determines a priority for the job based on the resources requested and schedules it on the queue. Priority calculation can be complex, but the overall goal of the scheduler is to optimize a tradeoff between throughput on the cluster as a whole and turnaround time on jobs.
The Commands section describes how to manage jobs and check cluster status using standard Slurm commands. The Resources section describes how to request computing resources for jobs. The Job Scripts section includes examples of job scripts to be used with sbatch
.
After logging in to a cluster, your session exists on the head node: a single, less powerful computer that serves as the gatekeeper to the rest of the cluster. To do actual work, you will need to write submission scripts that define your job and submit them to the cluster along with resource requests.
"},{"location":"scheduler/commands/#batch-jobs-sbatch","title":"Batch Jobs:sbatch
","text":"Most of the time, you will want to submit jobs in the form of job scripts. The batch job script specifies the resources needed for the job, such as the number of nodes, cores, memory, and walltime. A simple example would be:
jobscript.sh#!/bin/bash \n# (1)\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=1\n#SBATCH --time=01:00:00\n#SBATCH --mem=100MB\n#SBATCH --partition=low\n\necho \"Running on $(hostname)\"\n
/bin/sh
or /bin/zsh
.Which can be submitted to the scheduler by running:
$ sbatch jobscript.sh\nSubmitted batch job 629\n
The job script is a normal shell script -- note the #!/bin/bash
-- that contains additional directives. #SBATCH
lines specify directives to be sent to the scheduler; in this case, our resource requests:
--ntasks
: Number of tasks to run. Slurm may schedule tasks on the same or different nodes.--cpus-per-task
: Number of CPUs (cores) to allocate per task.--time
: Maximum wallclock time for the job.--mem
: Maximum amount of memory for the job.--partition
: The queue partition to submit to. See the queueing section for more details.Jobs that exceed their memory or time constraints will be automatically killed. There is no limit on spawning threads, but keep in mind that using far more threads than requested cores will result in rapidly decreasing performance.
#SBATCH
directives directly correspond to arguments passed to the sbatch
command. As such, one could remove the lines starting with #SBATCH
from the previous job script and submit it with:
$ sbatch --ntasks=1 --cpus-per-task=1 --time=01:00:00 --mem=100MB --partition=low jobscript.sh\n
Using directives with job scripts is recommended, as it helps you document your resource requests.
Try man sbatch
or visit the official docs for more options. More information on resource requests can be found in the Resources section, and more examples on writing job scripts can be found in the Job Scripts section.
srun
","text":"Sometimes, you want to run an interactive shell session on a node, such as running an IPython session. srun
takes the same parameters as sbatch
, while also allowing you to specify a shell. For example:
$ srun --ntasks=1 --time=01:00:00 --mem=100MB --partition=low --pty /bin/bash\nsrun: job 630 queued and waiting for resources\nsrun: job 630 has been allocated resources\ncamw@c-8-42:~$\n
Note that addition of the --pty /bin/bash
argument. You can see that the job is queued and then allocated resources, but instead of exiting, you are brought to a new prompt. In the example above, the user camw
has been moved onto the node c-8-42
, which is indicated by the new terminal prompt, camw@c-8-42
. The same resource and time constraints apply in this session as in sbatch
scripts.
This is the only way to get direct access to a node: you will not be able to simply do ssh c-8-42
, for example.
Try man srun
or visit the official docs for more options.
squeue
","text":"squeue
can be used to monitor running and queued jobs. Running it with no arguments will show all the jobs on the cluster; depending on how many users are active, this could be a lot!
$ squeue\n JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)\n 589 jawdatgrp Refine3D adtaheri R 1-13:51:39 1 gpu-9-18\n 631 low jobscrip camw R 0:19 1 c-8-42\n 627 low Class2D/ mashaduz R 37:11 1 gpu-9-58\n...\n
To view only your jobs, you can use squeue --me
.
$ squeue --me\n JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)\n 631 low jobscrip camw R 0:02 1 c-8-42\n
The format -- which columns and their width -- can be tuned with the --format
parameter. For example, you might way to also include how many cores the job requested, and widen the fields:
$ squeue --format=\"%10i %.9P %.20j %.10u %.3t %.25S %.15L %.10C %.6D %.20R\"\nJOBID PARTITION NAME USER ST START_TIME TIME_LEFT CPUS NODES NODELIST(REASON)\n589 jawdatgrp Refine3D/job015/ adtaheri R 2023-01-31T22:51:59 9:58:38 6 1 gpu-9-18\n627 low Class2D/job424/ mashaduz R 2023-02-02T12:06:27 11:13:06 60 1 gpu-9-58\n
Try man squeue
or visit the official docs for more options.
scancel
","text":"To kill a job before it has completed, use the scancel command:
$ scancel JOBID # (1)!\n
JOBID
with the ID of your job, which can be obtained with squeue
.You can cancel many jobs at a time; for example, you could cancel all of your running jobs with:
$ scancel -u $USER #(1)!\n
$USER
is an environment variable containing your username, so leave this as is to use it.Try man scancel
or visit the official docs for more options.
scontrol
","text":"scontrol show
can be used to display any information known to Slurm. For users, the most useful are the detailed job and node information.
To display details for a job, run:
$ scontrol show j 635\nJobId=635 JobName=jobscript.sh\n UserId=camw(1134153) GroupId=camw(1134153) MCS_label=N/A\n Priority=6563 Nice=0 Account=admin QOS=adminmed\n JobState=RUNNING Reason=None Dependency=(null)\n Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0\n RunTime=00:00:24 TimeLimit=01:00:00 TimeMin=N/A\n SubmitTime=2023-02-02T13:26:24 EligibleTime=2023-02-02T13:26:24\n AccrueTime=2023-02-02T13:26:24\n StartTime=2023-02-02T13:26:25 EndTime=2023-02-02T14:26:25 Deadline=N/A\n PreemptEligibleTime=2023-02-02T13:26:25 PreemptTime=None\n SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-02-02T13:26:25 Scheduler=Main\n Partition=low AllocNode:Sid=nas-8-0:449140\n ReqNodeList=(null) ExcNodeList=(null)\n NodeList=c-8-42\n BatchHost=c-8-42\n NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*\n TRES=cpu=2,mem=100M,node=1,billing=2\n Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*\n MinCPUsNode=1 MinMemoryNode=100M MinTmpDiskNode=0\n Features=(null) DelayBoot=00:00:00\n OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)\n Command=/home/camw/jobscript.sh\n WorkDir=/home/camw\n StdErr=/home/camw/slurm-635.out\n StdIn=/dev/null\n StdOut=/home/camw/slurm-635.out\n Power=\n
Where 635
should be replaced with the ID for your job. For example, you can see that this job was allocated resources on c-8-42
(NodeList=c-8-42
), that its priority score is 6563 (Priority=6563
), and that the script it ran with is located at /home/camw/jobscript.sh
.
We can also get details on nodes. Let's interrogate c-8-42
:
$ scontrol show n c-8-42\nNodeName=c-8-42 Arch=x86_64 CoresPerSocket=64 \n CPUAlloc=4 CPUEfctv=256 CPUTot=256 CPULoad=0.12\n AvailableFeatures=amd,cpu\n ActiveFeatures=amd,cpu\n Gres=(null)\n NodeAddr=c-8-42 NodeHostName=c-8-42 Version=22.05.6\n OS=Linux 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 \n RealMemory=1000000 AllocMem=200 FreeMem=98124 Sockets=2 Boards=1\n State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A\n Partitions=low,high \n BootTime=2022-12-11T02:25:44 SlurmdStartTime=2022-12-14T10:34:25\n LastBusyTime=2023-02-02T13:13:22\n CfgTRES=cpu=256,mem=1000000M,billing=256\n AllocTRES=cpu=4,mem=200M\n CapWatts=n/a\n CurrentWatts=0 AveWatts=0\n ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s\n
CPUAlloc=4
tells us that 4 cores are currently allocated on the node. AllocMem=200
indicates that 200MiB of RAM are currently allocated, with RealMemory=1000000
telling us that there is 1TiB of RAM total on the node.
sinfo
","text":"Another useful status command is sinfo
, which is specialized for displaying information on nodes and partitions. Running it without any arguments gives information on partitions:
$ sinfo\nPARTITION AVAIL TIMELIMIT NODES STATE NODELIST\nlow* up 12:00:00 3 mix gpu-9-[10,18,58]\nlow* up 12:00:00 8 idle c-8-[42,50,58,62,70,74],gpu-9-[26,66]\nhigh up 60-00:00:0 6 idle c-8-[42,50,58,62,70,74]\njawdatgrp-gpu up infinite 2 mix gpu-9-[10,18]\njawdatgrp-gpu up infinite 1 idle gpu-9-26\n
In this case, we can see that there are 3 partially-allocated nodes in the low
partition (they have state mix
), and that the time limit for jobs on the low
partition is 12 hours.
Passing the -N
flag tells sinfo
to display node-centric information:
$ sinfo -N\nNODELIST NODES PARTITION STATE \nc-8-42 1 low* idle \nc-8-42 1 high idle \nc-8-50 1 low* idle \nc-8-50 1 high idle \nc-8-58 1 low* idle \nc-8-58 1 high idle \nc-8-62 1 low* idle \nc-8-62 1 high idle \nc-8-70 1 low* idle \nc-8-70 1 high idle \nc-8-74 1 low* idle \nc-8-74 1 high idle \ngpu-9-10 1 low* mix \ngpu-9-10 1 jawdatgrp-gpu mix \ngpu-9-18 1 low* mix \ngpu-9-18 1 jawdatgrp-gpu mix \ngpu-9-26 1 low* idle \ngpu-9-26 1 jawdatgrp-gpu idle \ngpu-9-58 1 low* mix \ngpu-9-66 1 low* idle\n
There is an entry for each node in each of its partitions. c-8-42
is in both the low
and high
partitions, while gpu-9-10
is in the low
and jawdatgrp-gpu
partitions.
More verbose information can be obtained by also passing the -l
or --long
flag:
$ sinfo -N -l\nThu Feb 02 14:04:48 2023\nNODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON \nc-8-42 1 low* idle 256 2:64:2 100000 0 1 amd,cpu none \nc-8-42 1 high idle 256 2:64:2 100000 0 1 amd,cpu none \nc-8-50 1 low* idle 256 2:64:2 100000 0 1 amd,cpu none \nc-8-50 1 high idle 256 2:64:2 100000 0 1 amd,cpu none \nc-8-58 1 low* idle 256 2:64:2 100000 0 1 amd,cpu none\n...\n
This view gives the nodes' socket, core, and thread configurations, their RAM, and the feature list, which you can read about in the Resources section. Try man scontrol
or man sinfo
, or visit the official docs for scontrol
and sinfo
, for more options.
Each node -- physically distinct machines within the cluster -- will be a member of one or more partitions. A partition consists of a collection of nodes, a policy for job scheduling on that partition, a policy for conflicts when nodes are a member of more than one partition (ie. preemption), and a policy for managing and restricting resources per user or per group referred to as Quality of Service. The Slurm documentation has detailed information on how preemption and QOS definitions are handled; our per-cluster Resources sections describe how partitions are organized and preemption handled on our clusters.
"},{"location":"scheduler/resources/#accounts","title":"Accounts","text":"Users are granted access to resources via Slurm associations. An association links together a user with an account and a QOS definition. Accounts most commonly correspond to your lab, but sometimes exist for graduate groups, departments, or institutes.
To see your associations, and thus which accounts and partitions you have access to, you can use the sacctmgr
command:
$ sacctmgr show assoc user=$USER\n Cluster Account User Partition Share ... MaxTRESMins QOS Def QOS GrpTRESRunMin \n---------- ---------- ---------- ---------- --------- ... ------------- -------------------- --------- ------------- \n franklin hpccfgrp camw mmgdept-g+ 1 ... hpccfgrp-mmgdept-gp+ \n franklin hpccfgrp camw mmaldogrp+ 1 ... hpccfgrp-mmaldogrp-+ \n franklin hpccfgrp camw cashjngrp+ 1 ... hpccfgrp-cashjngrp-+ \n franklin hpccfgrp camw jalettsgr+ 1 ... hpccfgrp-jalettsgrp+ \n franklin hpccfgrp camw jawdatgrp+ 1 ... hpccfgrp-jawdatgrp-+ \n franklin hpccfgrp camw low 1 ... hpccfgrp-low-qos \n franklin hpccfgrp camw high 1 ... hpccfgrp-high-qos \n franklin jawdatgrp camw low 1 ... mcbdept-low-qos \n franklin jawdatgrp camw jawdatgrp+ 1 ... jawdatgrp-jawdatgrp+ \n franklin jalettsgrp camw jalettsgr+ 1 ... jalettsgrp-jalettsg+ \n franklin jalettsgrp camw low 1 ... mcbdept-low-qos \n
The output is very wide, so you may want to pipe it through less
to make it more readable:
sacctmgr show assoc user=$USER | less -S\n
Or, perhaps preferably, output it in a more compact format:
$ sacctmgr show assoc user=$USER format=\"account%20,partition%20,qos%40\"\n Account Partition QOS \n-------------------- -------------------- ---------------------------------------- \n hpccfgrp mmgdept-gpu hpccfgrp-mmgdept-gpu-qos \n hpccfgrp mmaldogrp-gpu hpccfgrp-mmaldogrp-gpu-qos \n hpccfgrp cashjngrp-gpu hpccfgrp-cashjngrp-gpu-qos \n hpccfgrp jalettsgrp-gpu hpccfgrp-jalettsgrp-gpu-qos \n hpccfgrp jawdatgrp-gpu hpccfgrp-jawdatgrp-gpu-qos \n hpccfgrp low hpccfgrp-low-qos \n hpccfgrp high hpccfgrp-high-qos \n jawdatgrp low mcbdept-low-qos \n jawdatgrp jawdatgrp-gpu jawdatgrp-jawdatgrp-gpu-qos \n jalettsgrp jalettsgrp-gpu jalettsgrp-jalettsgrp-gpu-qos \n jalettsgrp low mcbdept-low-qos \n
In the above example, we can see that user camw
has access to the high
partition via an association with hpccfgrp
and the jalettsgrp-gpu
partition via the jalettsgrp
account.
CPUs are the central compute power behind your jobs. Most scientific software supports multiprocessing (multiple instances of an executable with discrete memory resources, possibly but not necessarily communicating with each other), multithreading (multiple paths, or threads, of execution within a process on a node, sharing the same memory resources, but able to execute on different cores), or both. This allows computation to scale with increased numbers of CPUs, allowing bigger datasets to be analyzed.
Slurm's CPU management methods are complex and can quickly become confusing. For the purposes of this documentation, we will provide a simplified explanation; those with advanced needs should consult the Slurm documentation.
Slurm follows a distinction between its physical resources -- cluster nodes and CPUs or cores on a node -- and virtual resources, or tasks, which specify how requested physical resources will be grouped and distributed. By default, Slurm will minimize the number of nodes allocated to a job, and attempt to keep the job's CPU requests localized within a node. Tasks group together CPUs (or other resources): CPUs within a task will be kept together on the same node. Different tasks may end up on different nodes, but Slurm will exhaust the CPUs on a given node before splitting tasks between nodes unless specifically requested.
A Complication: SMT / Hyperthreading
Slurm understands the distinction between physical and logical cores. Most modern CPUs support Simultaneous Multithreading (SMT), which allows multiple independent processes to run on a single physical core. Although each of these is not a full fledged core, they have independent hardware for certain operations, and can greatly improve scalability for some tasks. However, using an individual thread within a single core makes little sense, as it shares hardware with the other SMT threads on its core; so, Slurm will always keep these threads together. In practice, this means if you ask for an odd number of CPUs, your request will be rounded up so as not to split an SMT thread between different job allocations.
The primary parameters controlling these are:
--cpus-per-task/-c
: How many CPUs to request per task. The number of CPUs requested here will always be on the same node. By default, 1.--ntasks/-n
: The number of tasks to request. By default, 1.--nodes/-N
: The minimum number of nodes to request, by default, 1.Let's explore some examples. The simple request would be to ask for 2 CPUs. We will use srun
to request resources and then immediately run the nproc
command within the allocation to report how many CPUs are available:
$ srun -c 2 nproc \nsrun: job 682 queued and waiting for resources\nsrun: job 682 has been allocated resources\n2\n
We asked for 2 CPUs per task, and Slurm gave us 2 CPUs and 1 task. What happens if we ask for 2 tasks instead of 2 CPUs?
$ srun -n 2 nproc\nsrun: job 683 queued and waiting for resources\nsrun: job 683 has been allocated resources\n1\n1\n
This time, we were given 2 separate tasks, each of which got 1 CPU. Each task ran its own instance of the nproc
command, and so each reported 1
. If we ask for more CPUs per task:
$ srun -n 2 -c 2 nproc\nsrun: job 684 queued and waiting for resources\nsrun: job 684 has been allocated resources\n2\n2\n
We still asked for 2 tasks, but this time we requested 2 CPUs in each. So, we got 2 instances of nproc
, each reported 2
CPUs in their task.
Summary
If you want to run multithreaded jobs, use --cpus-per-task N_THREADS
and -ntasks 1
. If you want a multiprocess job (or an MPI job), increase -ntasks
.
If we use -c 1
without specifying the number of tasks, we might be taken by surprise:
$ srun -c 1 nproc \nsrun: job 685 queued and waiting for resources\nsrun: job 685 has been allocated resources\n1\n1\n
We only asked for 1 CPU per task, but we got 2 tasks! This is due to SMT, described in the note above. Because Slurm will not split SMT threads, and there are 2 SMT threads per physical core, the request was rounded up to 2 CPUs total. In order to keep with the 1 CPU-per-task constraint, it spawned 2 tasks. Similarly, if we specify that we only want 1 task, CPUs per task will instead be bumped:
$ srun -c 1 -n 1 nproc\nsrun: job 686 queued and waiting for resources\nsrun: job 686 has been allocated resources\n2\n
"},{"location":"scheduler/resources/#nodes","title":"Nodes","text":"Let's explore multiple nodes a bit further. We have seen previously that the -n/ntasks
parameter will allocate discrete groups of cores. In our prior examples, however, we used small resource requests. What happens when we want to distribute jobs across nodes?
Slurm uses the block distribution method by default to distribute tasks between nodes. It will exhaust all the CPUs on a node with task groups before moving to a new node. For these examples, we're going to create a script that reports both the hostname (ie, the node) and the number of CPUs:
host-nprocs.sh#!/bin/bash\n\necho `hostname`: `nproc`\n
And make it executable with chmod +x host-nprocs.sh
.
Now let's make a multiple-task request:
$ srun -c 2 -n 2 ./host-nprocs.sh\nsrun: job 691 queued and waiting for resources\nsrun: job 691 has been allocated resources\nc-8-42: 2\nc-8-42: 2\n
As before, we asked for 2 tasks and 2 CPUs per task. Both tasks were assigned to c-8-42
, because it had enough CPUs to fulfill the request. What if it did not?
$ srun -c 120 -n 3 ./host-nprocs.sh\nsrun: job 692 queued and waiting for resources\nsrun: job 692 has been allocated resources\nc-8-42: 120\nc-8-42: 120\nc-8-50: 120\n
This time, we asked for 3 tasks each with 120 CPUs. The first two tasks were able to be fulfilled by the node c-8-42
, but that node did not have enough CPUs to allocate another 120 on top of that. So, the third task was distributed to c-8-50
. Thus, this task spanned multiple nodes.
Sometimes, we want to make sure each task has its own node. We can achieve this with the --nodes/-N
parameter. This specifies the minimum number of nodes the tasks should be allocated across. If we rerun the above example:
$ srun -c 120 -n 3 -N 3 ./host-nprocs.sh\nsrun: job 693 queued and waiting for resources\nsrun: job 693 has been allocated resources\nc-8-42: 120\nc-8-50: 120\nc-8-58: 120\n
We still asked for 3 tasks and 3 CPUs per task, but this time we specified we wanted a minimum of 3 nodes. As a result, we were allocated portions of c-8-42
, c-8-50
, and c-8-58
.
Random Access Memory (RAM) is the fast, volatile storage that your programs use to store data during execution. This can be contrasted with disk storage, which is non-volatile and many orders of magnitude slower to access, and is used for long term data -- say, your sequencing runs or cryo-EM images. RAM is a limited resource on each node, so Slurm enforces memory limits for jobs using cgroups. If a job step consumes more RAM than requested, the step will be killed.
Some (mutually exclusive) parameters for requesting RAM are:
--mem
: The memory required per-node. Usually, you want to use --mem-per-cpu
.--mem-per-cpu
: The memory required per CPU or core. If you requested \\(N\\) tasks, \\(C\\) CPUs per task, and \\(M\\) memory per CPU, your total memory usage will be \\(N * C * M\\). Note that, if \\(N \\gt 1\\), you will have \\(N\\) discrete \\(C * M\\)-sized chunks of RAM requested, possibly across different nodes.--mem-per-gpu
: Memory required per GPU, which will scale with GPUs in the same way as --mem-per-cpu
will with CPUs.For all memory requests, units can be specified explicitly with the suffixes [K|M|G|T]
for [kilobytes|megabytes|gigabytes|terabytes]
, with the default units being M
/megabytes
. So, --mem-per-cpu=500
will requested 500 megabytes of RAM per CPU, and --mem-per-cpu=32G
will request 32 gigabytes of RAM per CPU.
Here is an example of a task overrunning its memory allocation. We will use the stress-ng
program to allocate 8 gigabytes of RAM in a job that only requested 200 megabytes.
$ srun -n 1 --cpus-per-task 2 --mem-per-cpu 200M stress-ng -m 1 --vm-bytes 8G --oomable 1 \u21b5\nsrun: job 706 queued and waiting for resources\nsrun: job 706 has been allocated resources\nstress-ng: info: [3037475] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor\nstress-ng: info: [3037475] dispatching hogs: 1 vm\nstress-ng: info: [3037475] successful run completed in 2.23s\nslurmstepd: error: Detected 1 oom-kill event(s) in StepId=706.0. Some of your processes may have been killed by the cgroup out-of-memory handler.\nsrun: error: c-8-42: task 0: Out Of Memory\nsrun: launch/slurm: _step_signal: Terminating StepId=706.0\n
"},{"location":"scheduler/resources/#gpus-gres","title":"GPUs / GRES","text":""},{"location":"software/","title":"Software","text":"We provide a broad range of software and scientific libraries for our users. On our primary clusters, we deploy software with spack and conda to foster reproducibility. Packages installed via spack have automatically generated modulefiles available for each package, while our conda-deployed software is deployed in individual, centrally installed environments which can be loaded via module or activated traditionally.
To request new software, please read the information in our Software Installation Policy, and submit a request through our Software Request Form.
Because spack is our primary installation method, we are far more likely to approve a new piece of software if it is already available via a spack repo. You can check whether a spack package for your software already exists here.
"},{"location":"software/conda/","title":"Python and Conda","text":""},{"location":"software/modules/","title":"Module System","text":""},{"location":"software/modules/#intro","title":"Intro","text":"High performance compute clusters usually have a variety of software with sometimes conflicting dependencies. Software packages may need to make modifications to the user environment, or the same software may be compiled multiple times to run efficiently on differing hardware within the cluster. To support these use cases, software is managed with a module system that prepares the user environment to access specific software on load and returns the environment to its former state when unloaded. A module is the bit of code that enacts and tracks these changes to the user environment, and the module system is software that runs these modules and the collection of modules it is aware of. Most often, a module is associated with a specific software package at a specific version, but they can also be used to make more general changes to a user environment; for example, a module could load a set of configurations for the BASH shell that set color themes.
The two most commonly deployed module systems are environment modules (or envmod
) and lmod. All HPCCF clusters currently use envmod
.
The module
command is the entry point for users to manage modules in their environment. All module operations will be of the form module [SUBCOMMAND]
. Usage information is available on the cluster by running module --help
.
The basic commands are: module load [MODULENAME]
to load a module into your environment; module unload [MODULENAME]
to remove that module; module avail
to see modules available for loading; and module list
to see which modules are currently loaded. We will go over these commands, and some additional commands, in the following sections.
module avail
","text":"Lists the modules currently available to load on the system. The following is some example output from the Franklin cluster:
$ module avail\n--------------------- /share/apps/22.04/modulefiles/spack/core ----------------------\naocc/4.1.0 intel-oneapi-compilers/2023.2.1 pmix/4.2.6+amd \ncuda/8.0.61 libevent/2.1.12 pmix/4.2.6+intel \ncuda/11.2.2 libevent/2.1.12+amd pmix/default \ncuda/11.7.1 libevent/2.1.12+intel slurm/23-02-6-1 \ncuda/default libevent/default slurm/23-02-7-1 \nenvironment-modules/5.2.0 openmpi/4.1.5 ucx/1.14.1 \ngcc/7.5.0 openmpi/4.1.5+amd ucx/1.14.1+amd \ngcc/11.4.0 openmpi/4.1.5+intel ucx/1.14.1+intel \ngcc/13.2.0 openmpi/default ucx/default \nhwloc/2.9.1 pmix/4.2.6 \n\n------------------- /share/apps/22.04/modulefiles/spack/software --------------------\nabyss/2.3.5 igv/2.12.3 pplacer/1.1.alpha19 \nalphafold/2.3.2 infernal/1.1.4 prodigal/2.6.3 \namdfftw/4.1+amd intel-oneapi-mkl/2023.2.0+intel prokka/1.14.6 \namdfftw/default intel-oneapi-mkl/default py-checkm-genome/1.2.1 \nangsd/0.935 intel-oneapi-tbb/2021.10.0+amd py-cutadapt/4.4 \naragorn/1.2.41 intel-oneapi-tbb/2021.10.0+intel py-deeptools/3.5.2 \naria2/1.36.0 intel-oneapi-tbb/default py-htseq/2.0.3\n...\n
Each entry corresponds to software available for load. Modules that are currently loaded will be highlighted.
"},{"location":"software/modules/#module-list","title":"module list
","text":"Lists the modules currently loaded in the user environment. By default, the output should be similar to:
$ module list\nCurrently Loaded Modulefiles:\n 1) slurm/23-02-7-1 2) ucx/1.14.1 3) openmpi/default \n
Additional modules will be added or removed as you load and unload them.
"},{"location":"software/modules/#loading-and-unloading","title":"Loading and Unloading","text":""},{"location":"software/modules/#module-load","title":"module load
","text":"This loads the requested module into the active environment. Loading a module can edit environment variables, such as prepending directories to $PATH
so that the executables within can be run, set and unset new or existing environment variables, define shell functions, and generally, modify your user environment arbitrarily. The modifications it makes are tracked, so that when the module is eventually unloaded, any changes can be returned to their former state.
Let's load a module.
$ module load bwa/0.7.17\nbwa/0.7.17: loaded.\n
Now, you have access to the bwa
executable. If you try to run bwa mem
, you'll get its help output. This also sets the appropriate variables so that you can now run man bwa
to view its manpage.
Note that some modules have multiple versions. Running module load [MODULENAME]
without specifying a version will load the latest version, unless a default has been specified.
Some modules are nested under a deeper hierarchy. For example, relion
on Franklin has many variants, under both relion/cpu
and relion/gpu
. To load these, you must specify the second layer of the hierarchy: module load relion
will fail, but module load relion/cpu
will load the default module under relion/cpu
, which has the full name relion/cpu/4.0.0+amd
. More information on this system can be found under Organization.
The modules are all configured to set a $NAME_ROOT
variable that points to the installation prefix. This will correspond to the name of the module, minus the version. For example:
$ echo $BWA_ROOT\n/share/apps/22.04/spack/opt/software/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/bwa-0.7.17-y22jt6d7qm63i2tohmu7gqeedxytadky\n
Usually, this will be a very long pathname, as most software on the cluster is managed via the spack build system. This would be most useful if you're developing software on the cluster.
"},{"location":"software/modules/#module-unload","title":"module unload
","text":"As one might expect, module unload
removes a loaded module from your environment. Any changes made by the module are undone and your environment is restored to its state prior to loading the module.
module whatis
","text":"This command prints a description of the module, if such a description is available. For example:
$ module whatis gsl\n-------------------------------------------- /share/apps/22.04/modulefiles/spack/software --------------------------------------------\n gsl/2.7.1: The GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License. The library provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite.\n
"},{"location":"software/modules/#module-show","title":"module show
","text":"module show
will list all the changes a module would make to the environment.
$ module show gcc/11.4.0\n-------------------------------------------------------------------\n/share/apps/22.04/modulefiles/spack/core/gcc/11.4.0:\n\nmodule-whatis {The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Ada, and Go, as well as libraries for these languages.}\nconflict gcc\nprepend-path --delim : LD_LIBRARY_PATH /share/apps/22.04/spack/opt/core/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-11.4.0-evrz2iaatpna4lvzwh5sjujgfrlqprx5/lib64\n...\nsetenv CC /share/apps/22.04/spack/opt/core/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-11.4.0-evrz2iaatpna4lvzwh5sjujgfrlqprx5/bin/gcc\nsetenv CXX /share/apps/22.04/spack/opt/core/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-11.4.0-evrz2iaatpna4lvzwh5sjujgfrlqprx5/bin/g++\n...\n
This is particularly useful for developing with libraries where you might be interested in variables relevant to your build system.
"},{"location":"software/modules/#module-search","title":"module search
","text":"The module search
command allows you to search the names and whatis information for every module. The result will be a list of matching modules and the highlighted matching search terms.