Skip to content
This repository has been archived by the owner on Jun 30, 2023. It is now read-only.

RGCN workflows #231

Merged
merged 6 commits into from
Apr 7, 2023
Merged

RGCN workflows #231

merged 6 commits into from
Apr 7, 2023

Conversation

jds485
Copy link
Member

@jds485 jds485 commented Mar 22, 2023

This PR adds:

  • Snakemake workflow file
  • 2 config files: one with spatial train/test settings and one with temporal train/test settings.
  • Helper functions called in the snakemake file in utils.py.
  • Note that other files currently show as modified because this branch is based on PR add txt files with attribute and reach names #228 these files were not modified in this PR and should show as unchanged after that PR is merged.

The Snakemake file relies on the jds-edits branch of river-dl on USGS-R. I used the environment.yaml file for the Python package configuration. I didn't get to try the torch.jit method yet.

In the config files:

  • Only out_dir, val_segs_f, and test_segs_f change between the 2 different train/test splits (spatial and temporal)
  • Only the x_vars_file and out_dir change for the 3 different attribute combinations (min static dynamic, static dynamic, dynamic). The example config files use dynamic attributes for the spatial train/test split, and min static dynamic for the temporal train/test split.

I can point you to the directory with input files when you're ready to review this PR. The files you will need to run the snakemake workflow are all referenced in the config file. These are available on Caldera, but the quickest option may be for me to copy these input files into SharePoint and you can run the workflow on your computer. I can also point you to the output files I got for you to compare results. The performance metrics and training log should be the same with a fixed random seed.

You can run the workflow using:
snakemake --configfile config_insal_spatial-dyn_rgcn_pytorch.yml -s Snakefile_insal_rgcn_pytorch.smk -j #
Where # is the number of cores to use. I used 12 and one workflow finishes in about 30 mins on my laptop.

Closes #230

@jds485 jds485 requested a review from jdiaz4302 March 22, 2023 21:19
utils.py Show resolved Hide resolved
Comment on lines +163 to +164
.groupby([pd.Grouper(level=time_idx_name, freq=group_temporally),
pd.Grouper(level=spatial_idx_name)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this group by spatial_idx_name if it is in the condition not group_spatially?

I think I don't understand the purpose of site_based or how it's intended to be different from group_spatially because they both appear to just group by spatial_idx_name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently under review in this river-dl PR. This comment is related. Let me know if you think further explanation of the function arguments is needed.

One edit that may help is stating the number of observations used. When site_based = True, the observations are each group_temporally timestep for each reach with observations (max of n_segs*n_timesteps). When site_based = False, the observations are each group_temporally timestep with observations (max of n_timesteps).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still a gray area to me, but if it is being reviewed else where I will let you two handle it.

I might suggest a very minimal example/comment documentation could be used to communicate this since two of us are struggling with it.

utils.py Outdated Show resolved Hide resolved
@jdiaz4302
Copy link
Collaborator

I could see it being useful to have the predictions and observations to review the reshaping and calculations being performed. Baring my questions above, I think the functions are relatively standard or pre-reviewed (e.g., calc_metrics) though.

Storing multiple versions of the same config file with different values seems a little strange. Why was that preferred over one config value, perhaps with all options documented, and changed for a given job submission?

@jdiaz4302
Copy link
Collaborator

@jds485 I just realized that I did not review/look at the Snakemake file. I'll get back to that later. I might end up answering my own questions.

@jdiaz4302
Copy link
Collaborator

Okay, I read through the Snakemake file. At a quick read, it looks minimally modified from the rest of river-dl and the new functions are intuitively used.

I can point you to the directory with input files when you're ready to review this PR. The files you will need to run the snakemake workflow are all referenced in the config file. These are available on Caldera

Yeah, it might be best practice to for me to run this pipeline and look at some files considering it's a whole new(ly adapted) workflow. I don't expect any problems though.

@jdiaz4302
Copy link
Collaborator

Some additional steps I had to do to get this working:

  • had to download a few *.txt from your tallgrass space.
  • downloaded drb_distance_matrix.npz from a different directory of the sharepoint folder
  • I ran into a few Snakemake/pipeline errors of directories not being found because they weren't made yet. Not a blocker but it may be worth documenting or creating empty directories prepared for the expected output.
  • ["git status"] was problematic from river-dl, I had to change that to "git status" (no brackets) to work. Not sure if it's working correctly.

Current blocker:

  • Snakemake pipeline failing due to xr.open_zarr. Returns zarr.errors.GroupNotFoundError: group not found at path '', can be minimally replicated locally by running xr.open_zarr("/2_process/out/drb_attrs_PRMS.zarr",consolidated=False)

@jds485
Copy link
Member Author

jds485 commented Apr 5, 2023

Thanks, Jeremy! I'll address the blocker first.

Snakemake pipeline failing due to xr.open_zarr. Returns zarr.errors.GroupNotFoundError: group not found at path '', can be minimally replicated locally by running xr.open_zarr("/2_process/out/drb_attrs_PRMS.zarr",consolidated=False)

Where is this line located? It works for me with a "./2_process..."
xr.open_zarr("/2_process/out/drb_attrs_PRMS.zarr",consolidated=False)

@jds485
Copy link
Member Author

jds485 commented Apr 5, 2023

had to download a few *.txt from your tallgrass space.
downloaded drb_distance_matrix.npz from a different directory of the sharepoint folder

I had these in 4_predict/out and 1_fetch/out. Did they not appear there in Sharepoint? This might be related to the OneDrive sync issue I'm having.

I ran into a few Snakemake/pipeline errors of directories not being found because they weren't made yet. Not a blocker but it may be worth documenting or creating empty directories prepared for the expected output.

Thanks for letting me know. I don't think I had these errors when I ran it. It seemed like the directories were generated as needed. Did you have to manually create the directories?

["git status"] was problematic from river-dl, I had to change that to "git status" (no brackets) to work. Not sure if it's working correctly.

I see this problem, too. Let me file it as an issue in the river-dl repo.

@jdiaz4302
Copy link
Collaborator

Yes, I just manually created those directories.

@jdiaz4302
Copy link
Collaborator

The pipeline finishes all rules for me on Tallgrass.

I looked at some of the objects contained in prepped.npz and all the shapes were as I expected for RGCN usage.

  • The number of segments matched the size of the distance matrix
  • the distance matrix looks like a normal distance matrix
  • time series contained in x_trn were organized as expected and followed trends plausible with their x_var labels (e.g., static land covers stayed the same, temperature was seasonal, precipitation was episodic).
  • y_trn shape was compatible
    • Surprisingly this object appears to be 98% missing values 🤯, is that expected?
  • If I reload the model with trained weights and prepared data it often produces plausible predictions (e.g. below).
    • There were some duds, but I don't think that's surprising given the salinity target + missing data + first pass RGCN workflow.

decent_preds

@jds485
Copy link
Member Author

jds485 commented Apr 7, 2023

Thanks @jdiaz4302!

To document the blocker issue, we found that it was a result of a data transfer issue and the pipeline worked with the complete dataset.

I just manually created those directories.

Okay, thanks. I can add a line to the snakemake file that generates these directories.

y_trn shape was compatible

  • Surprisingly this object appears to be 98% missing values 🤯, is that expected?

Yes, for the full dataset we have about 180k observations/(459 reaches365 days38 years) ~ 97%

I think that addresses your comments. Let me know if this is good to merge after the directory generation edit

@jdiaz4302
Copy link
Collaborator

Yeah, I think it is good to merge. Explicitly:

  • My review focused on
    • If the pipeline was reproducible.
    • If your data appeared to be processed and used appropriately with RGCN.
    • An initial look at training/performance.
  • The new evaluation steps are being reviewed elsewhere, so I'm not duplicating efforts there.

@jds485 jds485 merged commit 9eed240 into USGS-R:main Apr 7, 2023
@jds485 jds485 deleted the 230-RGCN-workflows branch April 7, 2023 14:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add files for RGCN workflow
2 participants