-
Notifications
You must be signed in to change notification settings - Fork 4
Conversation
.groupby([pd.Grouper(level=time_idx_name, freq=group_temporally), | ||
pd.Grouper(level=spatial_idx_name)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this group by spatial_idx_name
if it is in the condition not group_spatially
?
I think I don't understand the purpose of site_based
or how it's intended to be different from group_spatially
because they both appear to just group by spatial_idx_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently under review in this river-dl PR. This comment is related. Let me know if you think further explanation of the function arguments is needed.
One edit that may help is stating the number of observations used. When site_based = True
, the observations are each group_temporally
timestep for each reach with observations (max of n_segs*n_timesteps). When site_based = False
, the observations are each group_temporally
timestep with observations (max of n_timesteps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still a gray area to me, but if it is being reviewed else where I will let you two handle it.
I might suggest a very minimal example/comment documentation could be used to communicate this since two of us are struggling with it.
I could see it being useful to have the predictions and observations to review the reshaping and calculations being performed. Baring my questions above, I think the functions are relatively standard or pre-reviewed (e.g., Storing multiple versions of the same config file with different values seems a little strange. Why was that preferred over one config value, perhaps with all options documented, and changed for a given job submission? |
@jds485 I just realized that I did not review/look at the Snakemake file. I'll get back to that later. I might end up answering my own questions. |
Okay, I read through the Snakemake file. At a quick read, it looks minimally modified from the rest of
Yeah, it might be best practice to for me to run this pipeline and look at some files considering it's a whole new(ly adapted) workflow. I don't expect any problems though. |
Some additional steps I had to do to get this working:
Current blocker:
|
Thanks, Jeremy! I'll address the blocker first.
Where is this line located? It works for me with a |
I had these in
Thanks for letting me know. I don't think I had these errors when I ran it. It seemed like the directories were generated as needed. Did you have to manually create the directories?
I see this problem, too. Let me file it as an issue in the river-dl repo. |
Yes, I just manually created those directories. |
The pipeline finishes all rules for me on Tallgrass. I looked at some of the objects contained in
|
Thanks @jdiaz4302! To document the blocker issue, we found that it was a result of a data transfer issue and the pipeline worked with the complete dataset.
Okay, thanks. I can add a line to the snakemake file that generates these directories.
Yes, for the full dataset we have about 180k observations/(459 reaches365 days38 years) ~ 97% I think that addresses your comments. Let me know if this is good to merge after the directory generation edit |
Yeah, I think it is good to merge. Explicitly:
|
This PR adds:
The Snakemake file relies on the
jds-edits
branch ofriver-dl
on USGS-R. I used theenvironment.yaml
file for the Python package configuration. I didn't get to try the torch.jit method yet.In the config files:
out_dir
,val_segs_f
, andtest_segs_f
change between the 2 different train/test splits (spatial and temporal)x_vars_file
andout_dir
change for the 3 different attribute combinations (min static dynamic, static dynamic, dynamic). The example config files use dynamic attributes for the spatial train/test split, and min static dynamic for the temporal train/test split.I can point you to the directory with input files when you're ready to review this PR. The files you will need to run the snakemake workflow are all referenced in the config file. These are available on Caldera, but the quickest option may be for me to copy these input files into SharePoint and you can run the workflow on your computer. I can also point you to the output files I got for you to compare results. The performance metrics and training log should be the same with a fixed random seed.
You can run the workflow using:
snakemake --configfile config_insal_spatial-dyn_rgcn_pytorch.yml -s Snakefile_insal_rgcn_pytorch.smk -j #
Where # is the number of cores to use. I used 12 and one workflow finishes in about 30 mins on my laptop.
Closes #230