Skip to content
deitrr edited this page Jun 15, 2021 · 7 revisions

RUN

1- Configure the initial conditions. Template of configuration is in "ifile/earth.thr". Copy that file where you'd like as an initial condition file. e.g.: "init/myplanet.thr"

1- Set the planet's and model's parameters in "init/myplanet.thr".

2- Run

   $ ./bin/esp init/myplanet.thr

3- Press enter and go grab a coffee. Or lunch.

  • command line arguments * Positional argument: config filename (e.g. init/myplanet.thr)

Keyword argument:

 -g / --gpu_id <N>             GPU_ID to run on
 -o / --output_dir <PATH>      directory to write results to
 -i / --initial <PATH>         initial conditions HDF5 filename
 -N / --numsteps <N>           number of steps to run
 -w / --overwrite              Force overwrite of output file if they exist
 -c / --continue <PATH>        continue simulation from this output file
 -b / --batch                  Run as batch

Keyword arguments supersede config file arguments. If initial conditions path is given on the command line, it starts from there instead of from rest and ignores the 'rest' setting in the config file.

  • -g / --gpu_id

Uses the GPU configured by parameter

  • -o / --output_dir Writes results to this directory. It will also scan the output directory to check for already existing files and run, continue or restart depending on options.

  • -N / --numsteps Number of steps of simulation to run.

  • -i / --initial Instead of starting from rest, use as initial conditions, using the provided model parameters. Checks consistency of models parameter with planet and grid definition used in initial file and starts from 0.

  • -w / --overwrite

if output directory already contains files, the simulation does not run and outputs a warning. This forces the simulation to overwrite existing files.

  • -c / --continue Continues a simulation from an output file. Like --initial, but continues at the simulation step and time from the input file. This provides the possibility to restart the simulation from a specific step (to contniue simulation, debug, or run with some changes in some parameters).

  • -b / --batch Run in batch mode in output directory. It checks output directory for result files:

    • if none exist, start a simulation from scratch.
    • if some exist, look for last valid written file, and continue from that point. Useful to run simulation on a cluster with a time limit. When the application gets the INT or TERM signal, it writes down the last simulation step to disk. Launching the simulation from a batch script with -b in the same folder starts the simulation or continues from the last save point point.
    • the batch feature can be used to restart a simulation from an arbitrary point as well, but this should be done with caution: to accomplish this, manually edit the esp_write_log_*.txt file, removing all lines after the desired restart point. Then when rerunning Thor with the -b flag, the simulation will begin from the desired point rather than the last save file. Note that this will blindly overwrite the duplicate output files, so be sure to make a copy of them (or, preferably, the entire directory) if the previous output is needed.
  • exclusive options: --initial and --continue are mutually exclusive. --batch and --continue are mutually exclusive.

Running benchmark tests

The benchmark tests will disable the aditional physics and enable forcing on the simulation. Set the core_benchmark key in the config file Available values are:

  1. HeldSuarez: Held-Suarez test for earth
  2. ShallowHotJupiter: Benchmark test for shallow hot Jupiter
  3. DeepHotJupiter: Benchmark test for deep hot Jupiter
  4. TidallyLockedEarth: Benchmark test for tidally locked Earth
  5. NoBenchmark: No benchmark test - enables external physics module

SLURM Batch script

Simple Batch script

Simple batch script launching SLURM on THOR, in /home/thoruser/THOR, with job name earth, configuration file ifile/earth.thr, on 1 task and 1 gpu, with a time limit of 2 hours, in file esp.job:

#!/bin/bash
#SBATCH -D /home/thoruser/THOR/
#SBATCH -J earth
#SBATCH -n 1 --gres gpu:1
#SBATCH --time 0-2
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --output="/home/thoruser/slurm-esp-%j.out"

srun bin/esp ifile/config.thr

Launch it in the job queue as

$ sbatch esp.job

Multiple batches with timeout

On a slurm queue with a time limit on jobs, you can run into the issue that the the queue kills your job after some time and you need to restart it to continue. For this, you need to queue several consecutive jobs with dependencies from one to the next.

Base script simple_slurm.sh, like the simple script, but starting in batch mode and sending interrupt 60 seconds before the end:

#!/bin/bash

#SBATCH -D /home/thoruser/THOR/
#SBATCH -J earth
#SBATCH -n 1 --gres gpu:1
#SBATCH --time 0-2
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --output="/home/thoruser/slurm-esp-%j.out"
#SBATCH --signal=INT@60

srun bin/esp ifile/config.thr -b

Batching script, slurm_batch.sh queuing a list of simple batches to restart.

#!/bin/bash

IDS=$(sbatch simple_slurm.sh)
ID=${IDS//[!0-9]/}
echo "ID $ID"

for t in {1..7..1}
do
  IDS=$(sbatch --dependency=afterany:$ID simple_slurm.sh)
  ID=${IDS//[!0-9]/}
  echo "ID $ID"
done

See slurm_batch_run.py for a Python script doing the same in one script.

Using slurm_batch_run.py for easier job submission

The python script THOR/tools/slurm_batch_run.py can be executed on the command line to create and submit jobs to a slurm scheduler. The options are sometimes specific to the Ubelix cluster at UniBe (but can be customized for different systems, depending on your needs).

Basic execution looks like:

$ python3 tools/slurm_batch_run.py <config file> -n <number of dependent jobs> -jn <descriptive identifier> <other options>

But first you should create a file called slurm.cfg in your THOR directory, that contains the following settings:

[DEFAULTS]
# working directory where slurm is run from
working_dir = <path to thor>
# email address to send slurm report to
user_email = <your email>
# where to log data in
log_dir = <path to a directory for slurm logs (you must create one first or jobs will mysteriously vanish!)>
# slurm resource request
gpu_key = <name of gpu type, for example: gpu:gtx1080ti:1>
# slurm partition
partition = <name of default partition to submit to, for example: gpu-invest>

This file above sets the defaults used by slurm_batch_run.py.

Options for slurm_batch_run.py:

 -n / --num_jobs <N>             number of dependent jobs (each dependent on the previous)
 -jn / --job_name <string>       descriptive name of job (for your own reference) 
 -o / --output <string>          specify output directory for THOR
 -d / --dependency <N>           run after job id given here
 -par / --partition <string>     name of partition to submit to
 -qos / --qos_preempt            add quality of service argument `job_gpu_preempt` (Ubelix specific, only with `gpu` partition)
 -g / --gpu_type                 gpu identifier (on Ubelix, these are `gtx1080ti`, `rtx2080ti`, `rtx3090`, and `teslaP100`
 -p / --prof                     run profiler on job (advanced)
 -r / --report                   run reporting code at end of sim (advanced)
 --pp                            run post-processing (advanced)

Some specifics about the job queues on Ubelix. Submitting to the gpu partition, you have a limited number of jobs you can submit for EACH TYPE of GPU. See Ubelix documentation for specifics: https://hpc-unibe-ch.github.io/ (under SLURM / Job handling -> Partitions / QoS). You can submit additional jobs beyond the limit by adding the -qos flag when running slurm_batch_run.py, though these jobs may be preempted (booted off the node) by users in investor queues. If you have access to the investor queue, you can run additional jobs with the option -par gpu-invest. Jobs submitted on the investor queue are subject to a group quota (I haven't figured out what this is yet). You can see details about which queues you have access to with the command:

$ sacctmgr show assoc where user=$USER format=user%20,account%20,partition%16,qos%40,defaultqos%20

Results

  • Output is written in "results" folder or path configured in config file or given as command line.
  • Very useful command lines to quickly explore the hdf5 files can be found in support.hdfgroup.org/HDFS/Tutor/cmdtools.html or type the command ">> man h5dump".
  • You can find some Matlab and Python routines to explore the results in "mjolnir" folder.