RGCN workflows #231

jds485 · 2023-03-22T20:58:36Z

This PR adds:

Snakemake workflow file
2 config files: one with spatial train/test settings and one with temporal train/test settings.
Helper functions called in the snakemake file in utils.py.
Note that other files currently show as modified because this branch is based on PR add txt files with attribute and reach names #228 these files were not modified in this PR and should show as unchanged after that PR is merged.

The Snakemake file relies on the jds-edits branch of river-dl on USGS-R. I used the environment.yaml file for the Python package configuration. I didn't get to try the torch.jit method yet.

In the config files:

Only out_dir, val_segs_f, and test_segs_f change between the 2 different train/test splits (spatial and temporal)
Only the x_vars_file and out_dir change for the 3 different attribute combinations (min static dynamic, static dynamic, dynamic). The example config files use dynamic attributes for the spatial train/test split, and min static dynamic for the temporal train/test split.

I can point you to the directory with input files when you're ready to review this PR. The files you will need to run the snakemake workflow are all referenced in the config file. These are available on Caldera, but the quickest option may be for me to copy these input files into SharePoint and you can run the workflow on your computer. I can also point you to the output files I got for you to compare results. The performance metrics and training log should be the same with a fixed random seed.

You can run the workflow using:
snakemake --configfile config_insal_spatial-dyn_rgcn_pytorch.yml -s Snakefile_insal_rgcn_pytorch.smk -j #
Where # is the number of cores to use. I used 12 and one workflow finishes in about 30 mins on my laptop.

Closes #230

config_insal_spatial-dyn_rgcn_pytorch.yml

config_insal_temporal-minstatdyn_rgcn_pytorch.yml

utils.py

jdiaz4302 · 2023-03-23T14:57:19Z

utils.py

+                .groupby([pd.Grouper(level=time_idx_name, freq=group_temporally),
+                         pd.Grouper(level=spatial_idx_name)])


Why does this group by spatial_idx_name if it is in the condition not group_spatially?

I think I don't understand the purpose of site_based or how it's intended to be different from group_spatially because they both appear to just group by spatial_idx_name

This is currently under review in this river-dl PR. This comment is related. Let me know if you think further explanation of the function arguments is needed.

One edit that may help is stating the number of observations used. When site_based = True, the observations are each group_temporally timestep for each reach with observations (max of n_segs*n_timesteps). When site_based = False, the observations are each group_temporally timestep with observations (max of n_timesteps).

It's still a gray area to me, but if it is being reviewed else where I will let you two handle it.

I might suggest a very minimal example/comment documentation could be used to communicate this since two of us are struggling with it.

utils.py

jdiaz4302 · 2023-03-23T15:05:08Z

I could see it being useful to have the predictions and observations to review the reshaping and calculations being performed. Baring my questions above, I think the functions are relatively standard or pre-reviewed (e.g., calc_metrics) though.

Storing multiple versions of the same config file with different values seems a little strange. Why was that preferred over one config value, perhaps with all options documented, and changed for a given job submission?

jdiaz4302 · 2023-03-23T15:05:52Z

@jds485 I just realized that I did not review/look at the Snakemake file. I'll get back to that later. I might end up answering my own questions.

jdiaz4302 · 2023-03-23T16:20:40Z

Okay, I read through the Snakemake file. At a quick read, it looks minimally modified from the rest of river-dl and the new functions are intuitively used.

I can point you to the directory with input files when you're ready to review this PR. The files you will need to run the snakemake workflow are all referenced in the config file. These are available on Caldera

Yeah, it might be best practice to for me to run this pipeline and look at some files considering it's a whole new(ly adapted) workflow. I don't expect any problems though.

jdiaz4302 · 2023-04-05T18:30:05Z

Some additional steps I had to do to get this working:

had to download a few *.txt from your tallgrass space.
downloaded drb_distance_matrix.npz from a different directory of the sharepoint folder
I ran into a few Snakemake/pipeline errors of directories not being found because they weren't made yet. Not a blocker but it may be worth documenting or creating empty directories prepared for the expected output.
["git status"] was problematic from river-dl, I had to change that to "git status" (no brackets) to work. Not sure if it's working correctly.
- I am using the conda environment specified by https://github.com/USGS-R/river-dl/blob/main/environment.yaml

Current blocker:

Snakemake pipeline failing due to xr.open_zarr. Returns zarr.errors.GroupNotFoundError: group not found at path '', can be minimally replicated locally by running xr.open_zarr("/2_process/out/drb_attrs_PRMS.zarr",consolidated=False)

jds485 · 2023-04-05T19:32:40Z

Thanks, Jeremy! I'll address the blocker first.

Snakemake pipeline failing due to xr.open_zarr. Returns zarr.errors.GroupNotFoundError: group not found at path '', can be minimally replicated locally by running xr.open_zarr("/2_process/out/drb_attrs_PRMS.zarr",consolidated=False)

Where is this line located? It works for me with a "./2_process..."
xr.open_zarr("/2_process/out/drb_attrs_PRMS.zarr",consolidated=False)

jds485 · 2023-04-05T19:43:51Z

had to download a few *.txt from your tallgrass space.
downloaded drb_distance_matrix.npz from a different directory of the sharepoint folder

I had these in 4_predict/out and 1_fetch/out. Did they not appear there in Sharepoint? This might be related to the OneDrive sync issue I'm having.

I ran into a few Snakemake/pipeline errors of directories not being found because they weren't made yet. Not a blocker but it may be worth documenting or creating empty directories prepared for the expected output.

Thanks for letting me know. I don't think I had these errors when I ran it. It seemed like the directories were generated as needed. Did you have to manually create the directories?

["git status"] was problematic from river-dl, I had to change that to "git status" (no brackets) to work. Not sure if it's working correctly.

I see this problem, too. Let me file it as an issue in the river-dl repo.

jdiaz4302 · 2023-04-05T20:28:45Z

Yes, I just manually created those directories.

jdiaz4302 · 2023-04-06T19:14:14Z

The pipeline finishes all rules for me on Tallgrass.

I looked at some of the objects contained in prepped.npz and all the shapes were as I expected for RGCN usage.

The number of segments matched the size of the distance matrix
the distance matrix looks like a normal distance matrix
time series contained in x_trn were organized as expected and followed trends plausible with their x_var labels (e.g., static land covers stayed the same, temperature was seasonal, precipitation was episodic).
y_trn shape was compatible
- Surprisingly this object appears to be 98% missing values 🤯, is that expected?
If I reload the model with trained weights and prepared data it often produces plausible predictions (e.g. below).
- There were some duds, but I don't think that's surprising given the salinity target + missing data + first pass RGCN workflow.

jds485 · 2023-04-07T14:24:22Z

Thanks @jdiaz4302!

To document the blocker issue, we found that it was a result of a data transfer issue and the pipeline worked with the complete dataset.

I just manually created those directories.

Okay, thanks. I can add a line to the snakemake file that generates these directories.

y_trn shape was compatible

Surprisingly this object appears to be 98% missing values 🤯, is that expected?

Yes, for the full dataset we have about 180k observations/(459 reaches365 days38 years) ~ 97%

I think that addresses your comments. Let me know if this is good to merge after the directory generation edit

jdiaz4302 · 2023-04-07T14:38:45Z

Yeah, I think it is good to merge. Explicitly:

My review focused on
- If the pipeline was reproducible.
- If your data appeared to be processed and used appropriately with RGCN.
- An initial look at training/performance.
The new evaluation steps are being reviewed elsewhere, so I'm not duplicating efforts there.

config, snakemake, and utils functions for RGCN workflow

a40f783

jds485 requested a review from jdiaz4302 March 22, 2023 21:19

jdiaz4302 reviewed Mar 23, 2023

View reviewed changes

config_insal_spatial-dyn_rgcn_pytorch.yml Show resolved Hide resolved

jdiaz4302 reviewed Mar 23, 2023

View reviewed changes

config_insal_temporal-minstatdyn_rgcn_pytorch.yml Show resolved Hide resolved

jdiaz4302 reviewed Mar 23, 2023

View reviewed changes

utils.py Show resolved Hide resolved

jdiaz4302 reviewed Mar 23, 2023

View reviewed changes

utils.py Outdated Show resolved Hide resolved

jds485 and others added 4 commits March 29, 2023 11:13

update arg names, comments

df33ea9

txt files for earliest and latest dates

9b9b9f5

edit to earlist and latest date use in function

bd41522

Merge branch 'USGS-R:main' into 230-RGCN-workflows

985f59c

jds485 mentioned this pull request Apr 5, 2023

git status problem in preproc_utils.py USGS-R/river-dl#215

Open

make output directories

f442303

jds485 merged commit 9eed240 into USGS-R:main Apr 7, 2023

jds485 deleted the 230-RGCN-workflows branch April 7, 2023 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RGCN workflows #231

RGCN workflows #231

jds485 commented Mar 22, 2023 •

edited

Loading

jdiaz4302 Mar 23, 2023

jds485 Mar 29, 2023

jdiaz4302 Apr 3, 2023

jdiaz4302 commented Mar 23, 2023

jdiaz4302 commented Mar 23, 2023

jdiaz4302 commented Mar 23, 2023

jdiaz4302 commented Apr 5, 2023

jds485 commented Apr 5, 2023

jds485 commented Apr 5, 2023 •

edited

Loading

jdiaz4302 commented Apr 5, 2023

jdiaz4302 commented Apr 6, 2023

jds485 commented Apr 7, 2023

jdiaz4302 commented Apr 7, 2023

		.groupby([pd.Grouper(level=time_idx_name, freq=group_temporally),
		pd.Grouper(level=spatial_idx_name)])

RGCN workflows #231

RGCN workflows #231

Conversation

jds485 commented Mar 22, 2023 • edited Loading

jdiaz4302 Mar 23, 2023

Choose a reason for hiding this comment

jds485 Mar 29, 2023

Choose a reason for hiding this comment

jdiaz4302 Apr 3, 2023

Choose a reason for hiding this comment

jdiaz4302 commented Mar 23, 2023

jdiaz4302 commented Mar 23, 2023

jdiaz4302 commented Mar 23, 2023

jdiaz4302 commented Apr 5, 2023

jds485 commented Apr 5, 2023

jds485 commented Apr 5, 2023 • edited Loading

jdiaz4302 commented Apr 5, 2023

jdiaz4302 commented Apr 6, 2023

jds485 commented Apr 7, 2023

jdiaz4302 commented Apr 7, 2023

jds485 commented Mar 22, 2023 •

edited

Loading

jds485 commented Apr 5, 2023 •

edited

Loading