-
Notifications
You must be signed in to change notification settings - Fork 36
Merlin workflow example with heat conduction problem
The overall ROM workflow involves several stages: offline FOM sample generation, POD/DMD training, projection-ROM operator building, and online ROM prediction. These stages in general cannot be executed automatically by one click:
- some manual pre/post-processing are required at each stage;
- even with the shell scripts that automatize such pre/post-processing, it is necessary for large-scale simulations to be submitted to a job-queueing system (such as slurm/moab) and wait for the queue and the actual job to be complete.
merlin makes the workflow seamless and automatic, even combined with HPC job submission system. Main advantages of using merlin are:
- Can overview/orchestrate the entire workflow with one
yaml
file - HPC job submission can be automatic: the jobs will be submitted, queued, executed automatically as soon as the prerequisite jobs are complete.
- Each execution case is containerized: an execution of the workflow stores in a separate directory all the command scripts, the results, and the error messages. This helps debugging of the workflow. Also, it prevents unintended data corruption from the previous runs.
The installation procedure for merlin is well documented here.
merlin can be used locally without any job submission, and in such case no server is required to be configured. In order to support job submission in parallel, server is required to be configured for merlin. Detailed procedures are documented in the merlin documentation.
For LC machines, dedicated IT servers can be created and configured. For the detailed instructions, see LLNL LC confluence page.
This demo executes the workflow equivalent to examples/dmd/heat_conduction_hdf.sh
. For detail setup of the workflow, see examples/merlin/heat_conduction_hdf.yaml
.
Assuming the libROM is built at $LIBROM_DIR
, we move to $LIBROM_DIR/examples/merlin
. We should see two files available,
heat_conduction_hdf.yaml
is the merlin config file for orchestrating the entire workflow. heat_conduction_hdf_samples.csv
provides the sample parameter values that will be run in the workflow.
If not using the server, we can use merlin locally. We first set up the batch type in heat_conduction_hdf.yaml
:
batch:
type: local
Simply running the following command will start the workflow:
merlin run --local heat_conduction_hdf.yaml
If the workflow is initiated, whether it is successful or not, a new directory heat_conduction_hdf_cases
is created:
The result of the workflow we just executed is all stored in a directory tagged with a time stamp, heat_conduction_hdf_cases/heat_conduction_hdf_20240502-142042
:
This result directory has the following structure:
heat_conduction_hdf_cases
|- heat_conduction_hdf_$(time_label)
|- dmd_data
|- dmd_list
|- merlin_info
|- prepare_dir
|- sample_foms
|- test_fom
Each directory corresponds to:
-
dmd_data
: snapshots for the training/test parameter values -
dmd_list
: list files for parameter values -
merlin_info
: detailed merlin info that corresponds to this run case -
prepare_dir
: command script/output/error of the stepprepare_dir
-
sample_foms
: command script/output/error of the stepsample_foms
-
test_fom
: command script/output/error of the steptest_fom
For the step test_form
, the results are stored as follows:
-
MERLIN_FINISHED
is an empty text file that indicates the successful run of the steptest_fom
. -
test_fom.sh
is the actual command line script that was executed for the step. -
test_fom.out
is the output result of executingtest_fom.sh
-
test_fom.err
is the error message from executingtest_fom.sh
, if failed. If successful, this file is an empty text file.
If running distributed way, the batch type in heat_conduction_hdf.yaml
should be set to flux
. Also the bank and resources should be specified:
batch:
type: flux
bank: asccasc
queue: pdebug
shell: /bin/bash
nodes: 1
We first run the configuration file to initiate the workflow,
merlin run heat_conduction_hdf.yaml
Unlike running locally, this does not start the jobs immediately. Rather, this initiate the workers in the server (configured for merlin), staying in the server until the workflow is finished. We then let the workers start the jobs:
merlin run-workers heat_conduction_hdf.yaml
This will return a similar command-line output as in case 1. Once all the jobs are finished, we should stop the workers,
merlin stop-workers
This will create the same result directory as for the case 1.