Skip to content
This repository has been archived by the owner on Jun 2, 2023. It is now read-only.

Edits to spatial train/val/test, additional performance metrics #211

Merged
merged 24 commits into from
Apr 26, 2023
Merged

Conversation

jds485
Copy link
Member

@jds485 jds485 commented Mar 10, 2023

  • I found that validation segments were appearing in the test set, and training segments were appearing in validation. I edited some if/else statements to ensure that only test segments are in testing, and only validation are in validation. I'm not sure if the previous implementation was as-intended or a bug.
  • I added bi-weekly, monthly, and yearly performance metrics, and a by-year summary of daily data. Also added log-KGE, and recommending that 21 be the minimum number of observations needed to compute a performance metric. For the 10th and 90th percentile metrics, this would give 2 samples each.
  • I added args to preprocessing to filter the x_data by an earliest and/or latest date. With these arguments, the xarray of data could contain NAs/NaNs before or after these dates, respectively. I think this is useful when there are lagged attributes that would have NAs in the lag before the 1st non-NA point.

Copy link
Collaborator

@janetrbarclay janetrbarclay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments throughout this - a couple for clarification, one where I think your suggested updates would change the functionality of the current code, and at least 1 or 2 places I suggested getting broader input on how others are using the code. Let me know if want to discuss any of these further.

river_dl/evaluate.py Outdated Show resolved Hide resolved
data = data[~data[spatial_idx_name].isin(train_sites)]
if val_sites and partition=='tst':
data = data[~data[spatial_idx_name].isin(val_sites)]
if train_sites and partition == 'trn':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is an omission in the original code and that the following code should be added to correct that:

if train_sites and partition=='val':
    data = data[~data[spatial_idx_name].isin(train_sites)]

If I'm understanding them correctly, I think the suggested edits change the functionality of this section. As I read the original code, it appears that train_sites (as well as test_sites and val_sites) were sites that only appeared in that partition, but they weren't necessarily the only sites in that partition. In the revised code, it appears that if train_sites is specified, it will use only those sites in the training evaluations (and remove those sites from the test and validation partition).

If the intention is to change the functionality of train_sites, etc, then it's probably good to have a broader conversation. I am not using that option in my projects currently, but I don't know if others are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read the original code, it appears that train_sites (as well as test_sites and val_sites) were sites that only appeared in that partition, but they weren't necessarily the only sites in that partition.

Yes, I agree. I wasn't sure if this was intentional. I could add another explicit_spatial_split parameter here to allow for the previous functionality when False and this new functionality when True. I'll hold off on that before receiving feedback from others

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow. Yeah. This was definitely a bug! 😮 Thanks for catching this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetrbarclay: in a comment below, Jeff suggests we assume that sites within the train/val/test are the only sites in those partitions. That's also what I would expect. Do you know of anyone who is relying on the previous method wherein sites that are not within train_sites/val_sites/test_sites could be in the train/val/test partitions?

river_dl/evaluate.py Show resolved Hide resolved
.apply(calc_metrics)
.reset_index()
)
elif group == "biweekly":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the addition of biweekly and yearly options for the metrics is great.

If I'm reading this correctly, "biweekly", "monthly", and "yearly" all also use "seg_id_nat". For consistency with the other grouped metrics, it seems good to have that included in the group list. (so group = ['seg_id_nat','biweekly'])

(and as an aside, I'm noticing we should remove the hardcoded reference to seg_id_nat and replace it with spatial_idx_name. I think it's just the 3 references in this section. Would you want to fix that in this PR since you're already editing this section?)

Also, without running (which I haven't done) I'm not sure how monthly and yearly are different from ['seg_id_nat','month'] and ['seg_id_nat','year'] since they are both grouping on the same things.

Copy link
Member Author

@jds485 jds485 Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how monthly and yearly are different from ['seg_id_nat','month'] and ['seg_id_nat','year'] since they are both grouping on the same things

The biweekly, monthly and yearly options are resampling the daily timeseries to those time steps by taking the sum of the data within those time periods (only for the days with observations). I'm not sure that sum is the best option and am open to other suggestions.

The resulting performance metrics are computed over all reaches, not by reach as with the ['seg_id_nat','time'] options, so I can add the group option that reports these metrics by reach

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with removing the "seg_id_nat" comparison to be more generic, but that will affect snakemake workflows. For example, the workflow examples all have a function that defines group using seg_id_nat. Might be better to address this problem in a separate issue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resulting performance metrics are computed over all reaches, not by reach as with the ['seg_id_nat','time'] options, so I can add the group option that reports these metrics by reach

I'm not super familiar with the pandas grouper (so maybe that's the source of my confusion), but both monthly and yearly use 2 pandas groupers, 1 on time_idx_name and one on spatial_idx_name, right? So are you summing by reach and then calculating metrics across all the reaches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are you summing by reach and then calculating metrics across all the reaches?

yes, that's right

Copy link
Member Author

@jds485 jds485 Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the group option that reports these metrics by reach

Edit: I added metrics for reach-biweekly, reach-monthly and reach-yearly timeseries.

We could also have reach-biweekly-month (summarize the biweekly timeseries by month), reach-biweekly-year, and reach-monthly-year. reach-biweekly-time would require an additional function to define a biweekly index for Python datetime objects.

river_dl/preproc_utils.py Outdated Show resolved Hide resolved
river_dl/preproc_utils.py Outdated Show resolved Hide resolved
@janetrbarclay janetrbarclay self-requested a review March 13, 2023 19:30
Copy link
Collaborator

@janetrbarclay janetrbarclay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grr. I intended to click "request changes" before. Sorry for the change!!

@jds485
Copy link
Member Author

jds485 commented Mar 14, 2023

Thanks for the review, @janetrbarclay! I think I addressed all of your comments. I think a few will require feedback from others before editing / reverting

@janetrbarclay
Copy link
Collaborator

Thanks for the review, @janetrbarclay! I think I addressed all of your comments. I think a few will require feedback from others before editing / reverting

Agreed! Let me know when you want me to glance at those parts again and I'll be happy to approve them.

Copy link
Collaborator

@jsadler2 jsadler2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jds485 - thanks for catching these things that have been slipping through the cracks! I have a few comments/suggestions that I think you/we should consider before making this merge.

river_dl/evaluate.py Show resolved Hide resolved
data = data[~data[spatial_idx_name].isin(train_sites)]
if val_sites and partition=='tst':
data = data[~data[spatial_idx_name].isin(val_sites)]
if train_sites and partition == 'trn':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow. Yeah. This was definitely a bug! 😮 Thanks for catching this.

river_dl/evaluate.py Outdated Show resolved Hide resolved
river_dl/evaluate.py Outdated Show resolved Hide resolved
@@ -268,6 +335,99 @@ def partition_metrics(
.apply(calc_metrics)
.reset_index()
)
elif group == "year":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting to be quite verbose. I suggest we split the group argument into two, maybe (group_spatially, and group_temporally). The group_spatially would be just a boolean. The group_temporally would be a str.

then the function could be something like:

if not group_spatially and not group_temporally:
    metrics = calc_metrics(data)
    # need to convert to dataframe and transpose so it looks like the
    # others
    metrics = pd.DataFrame(metrics).T
elif group_spatially and not group_temporally:
    metrics = data.groupby(spatial_idx_name).apply(calc_metrics).reset_index()
elif not group_spatially and group_temporally:
    metrics = data.groupby(pd.Grouper(index=time_idx_name, freq=group_temporally)).apply(calc_metrics).reset_index()
elif group_spatially and group_temporally:
    metrics = data.groupby([pd.Grouper(index=time_idx_name, freq=group_temporally),
                            pd.Grouper(index=spatial_idx_name)]
                            ).apply(calc_metrics).reset_index()

I think that should work. We'd just have to document how the group_temporally argument needs to work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a bigger change, but I think it would be worth trying. It would also require propagating the change all the way up including any Snakefiles that are using this function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also resolve #212

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that is much cleaner. I think one more argument would be needed for how to do temporal aggregation. I think what you've programmed would compute metrics for native timestep only (daily). I used a sum of the daily data to get biweekly, monthly, and yearly timesteps and compute metrics for those. Let me try out this edit because it will make the code much cleaner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this change in the most recent commit. I needed 4 group args to replicate the previous functionality. I think it's more generic now to using different timesteps of aggregation, but it's more difficult to understand how to apply the 4 args to compute the desired metrics. For reference, here is the function that I used in the snakemake file to assign the 4 new args to this function to compute different metrics for different groups. See the description of the 4 args in the function docstring.

#Order in the list is: 
#group_spatially (bool), group_temporally (False or timestep to use), sum_aggregation (bool), site_based (bool)
def get_grp_arg(wildcards):
	if wildcards.metric_type == 'overall':
		return [False, False, False, False]
	elif wildcards.metric_type == 'month':
		return [False, 'M', False, False]
	elif wildcards.metric_type == 'reach':
		return [True, False, False, False]
	elif wildcards.metric_type == 'month_reach':
		return [True, 'M', False, False]
	elif wildcards.metric_type == 'monthly_site_based':
		return [False, 'M', True, True]
	elif wildcards.metric_type == 'monthly_all_sites':
		return [False, 'M', True, False]
	elif wildcards.metric_type == 'monthly_reach':
		return [True, 'M', True, False]

Comment on lines 390 to 404
elif group == ["seg_id_nat", "monthly"]:
#filter the data to remove nans before computing the sum
#so that the same days are being summed in the year.
data_calc = (data_multiind.dropna()
.groupby(
[pd.Grouper(level=time_idx_name, freq='M'),
pd.Grouper(level=spatial_idx_name)])
.sum()
)
data_calc = data_calc.reset_index()
metrics = (data_calc
.groupby(spatial_idx_name)
.apply(calc_metrics)
.reset_index()
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why you would need to break it into two steps. Did you try this (below) and it didn't work?

Suggested change
elif group == ["seg_id_nat", "monthly"]:
#filter the data to remove nans before computing the sum
#so that the same days are being summed in the year.
data_calc = (data_multiind.dropna()
.groupby(
[pd.Grouper(level=time_idx_name, freq='M'),
pd.Grouper(level=spatial_idx_name)])
.sum()
)
data_calc = data_calc.reset_index()
metrics = (data_calc
.groupby(spatial_idx_name)
.apply(calc_metrics)
.reset_index()
)
elif group == ["seg_id_nat", "monthly"]:
#filter the data to remove nans before computing the sum
#so that the same days are being summed in the year.
data_calc = (data_multiind.dropna()
.groupby(
[pd.Grouper(level=time_idx_name, freq='M'),
pd.Grouper(level=spatial_idx_name)])
.apply(calc_metrics)
.reset_index()
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No additional . functions were allowed after .sum(), so I broke it into several lines.

Also, noting that the original and suggested code compute metrics for different groups. The original code creates a monthly timeseries for each reach and computes reach-based metrics using the monthly timeseries. The suggested code uses data in the native timestep (e.g., daily) to compute metrics for each month-reach. Both of these are options in the latest commit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question was, why do you need to do .sum() in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used .sum() to convert the daily timeseries to a biweekly, monthly, and yearly timeseries for each reach. I think this is analogous to the overall metrics computed for the native timestep, which is also indexed by date-reach. Performance metrics for these timeseries indicate how well the model is able to predict at those timesteps.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. A little slow, but I think I got it now. So it's a two-step process because first you are aggregating by the time-space dimensions (e.g., site-biweekly) and then you are aggregating by the space (e.g., site). So you end up with just one number per site, but it's looking at bigger windows of time at a time (e.g., two weeks instead of every day).

Am I getting that right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's right, why did you choose sum instead of mean? It seems like mean would be better.

For example, if I do a rmse on the biweekly sums, I get

site_id
01466500      8.862246
01472104     23.094643
01472119     23.529100
014721254    20.738631
014721259    16.497109
01473499     17.143101
01473500     17.544453
01473675     17.654814
01473780     15.881052
01474500     12.225832
01480617     35.625263
01480870     36.393360
01481000     24.799420
01481500     26.861900

which is way high.

If I do it on the means:

site_id
01466500     0.633631
01472104     1.651097
01472119     1.682208
014721254    1.481826
014721259    1.179091
01473499     1.225882
01473500     1.254535
01473675     1.262302
01473780     1.136314
01474500     0.874386
01480617     2.546138
01480870     2.601188
01481000     1.772298
01481500     1.919950

A lot lower.

Is that what you'd expect?

Copy link
Member Author

@jds485 jds485 Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's a two-step process because first you are aggregating by the time-space dimensions (e.g., site-biweekly) and then you are aggregating by the space (e.g., site). So you end up with just one number per site, but it's looking at bigger windows of time at a time

Yes, that's right.

why did you choose sum instead of mean?

Good point - the daily values are daily averages, not the total flow in a day. For some reason I was thinking of total flow, salinity, etc. I think the sum metrics will also be inflated compared to the mean. Here's an example for biweekly metrics for specific conductivity. Log RMSE is the same because log(sum_pred/n) - log(sum_obs/n) = log(sum_pred) - log(sum_obs)

metrics_sum
Out[126]: 
rmse               10407.172144
nse                    0.946307
rmse_top10         28507.780134
rmse_bot10          1128.871582
rmse_logged            0.162555
mean_bias          -3456.110397
mean_bias_top10   -11188.212460
mean_bias_bot10     -618.612363
nse_top10              0.793622
nse_bot10              0.803559
nse_logged             0.963944
kge                    0.899966
kge_logged             0.977389
kge_top10              0.881643
kge_bot10              0.875451
n_obs                966.000000
dtype: float64
metrics_mean
Out[127]: 
rmse                47.075655
nse                  0.649618
rmse_top10         121.434744
rmse_bot10          21.972157
rmse_logged          0.162555
mean_bias          -21.274822
mean_bias_top10    -88.179638
mean_bias_bot10    -15.051474
nse_top10           -1.496829
nse_bot10            0.258352
nse_logged           0.698174
kge                  0.756805
kge_logged           0.887974
kge_top10            0.220643
kge_bot10            0.732286
n_obs              966.000000
dtype: float64

river_dl/preproc_utils.py Outdated Show resolved Hide resolved
river_dl/preproc_utils.py Outdated Show resolved Hide resolved
@@ -596,6 +596,7 @@ def prep_y_data(
test_end_date=None,
val_sites=None,
test_sites=None,
explicit_spatial_partition=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think that we should just assume that it is explicit. If I specify train/val/test sites, I expect those to be the only sites in the partition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I added this argument to allow for the previous implementation. It sounds like the previous implementation was not intentional, so I'm happy to remove the explicit_spatial_partition arg here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I just realized something when I was working on the DO project that may change this conversation.

What if you have sites that are in the training set for the training time period, but have data in validation time period.

As the code is currently written, the data at training sites but in the validation time period will be included in the validation set.

With these suggested changes, even if a training site has data in the validation time period, that data will not be used in the evaluation.

I think that's why I wrote the code as it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense. I think that's similar to @janetrbarclay's point here. Is this also why evaluate.py did not remove training sites from validation? And would you also want validation sites to be included within the test set for the same reason? Those sites are currently not removed from the test set.

Here are some edits I can make:

  • keep the explicit_spatial_partition option in this function (prep_y_data) to allow for the functionality you're describing.
  • Add an explicit_spatial_partition option to the partition_metrics function in evaluate.py.
  • When explicit_spatial_partition = FALSE, (the current functionality) there should be a check in the code that there is a temporal split specified for train/val and val/test (whichever is applicable) so that data are not used in multiple splits.
  • I think the prep_y_data function and partition_metrics function should have the same train_sites, val_sites, and test_sites args, and functionality for those args. They are currently different.

Let me know what you think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jumping back into this after not thinking about it for awhile, I'm wanting to make sure I have a clear understanding in my head of the options.

It seems we can have

  1. trn_sites (and val_sites and tst_sites) as the only sites for the respective partitions (ie remove all other sites from those partitions, but those sites can appear in other partitions), or
  2. trn_sites (and val_sites and tst_sites) can appear only in those partitions (ie remove those sites from all other partitions, but other sites can appear in those partitions), or
  3. trn_sites (and val_sites and tst_sites) are the only sites for those partitions and they only appear there (ie remove those sites from other partitions and remove other sites from those partitions)

Right?

I'm not quite sure which option explicit_spatial_partition==TRUE or FALSE maps to.

Also, would we ever want different behavior among the partitions? (ie trn_sites can appear in the test partition, but tst_sites can't appear in the training partition) It would get messy and complicated and likely isn't worth adding unless there's a clear need.

I like the idea of a check to make sure that data are not used in multiple partitions. Maybe that's an addition that is irrespective of the partition approach that's used? (just a final step in prepping the data)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option 1 would be the same as option 3 if all 3 partitions were specified (right?).,
yes, that's right (though I wasn't thinking of that yesterday). If both of these options remain, maybe a note goes in the comments pointing this out?

I'm trying to think about how I would use this. My inclination is that if I'm specifying training sites, then I want those to be the only training sites. (otherwise what's the point of specifying sites?) So I guess I can see uses for options 1 and 3, but I'm not sure the point of # 2.

What do you think of keeping those options, with the explicit_spatial_partition as the flag that determines which one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! That's also how I would expect a function to use provided site partitions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if someone wanted something like the prior functionality of some get removed from other partitions and some don't, then they can fully specify all the partitions and list things how they want. (which means for option 3 it would be good for a specified list of sites for a partition to override removing sites specified for other paritions -> ie if trn_sites is specified, then those sites are removed from the validation partition, unless val_sites is also specified, in which case that takes precedence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can work that option in. I'll plan to get to this next week

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning this up for us!

@@ -1150,3 +1187,32 @@ def read_exclude_segs_file(exclude_file):
with open(exclude_file, "r") as s:
d = yaml.safe_load(s)
return [val for key, val in d.items()]

def check_partitions(data):
Copy link
Member Author

@jds485 jds485 Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetrbarclay Is this what you were envisioning for the partition check? y_obs uses the same ids_<partition> and times_<partition>`, so I think only the x data dict has to be checked.

But maybe each dataset has to be filtered to remove the site-days with NaN in y_obs because the sequences might overlap even though the data are NaN.

I decided to make this nan filter as well. Seems like a stricter check

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though I hadn't thought through the implementation details, this is the functionality I was picturing. My only suggestion has to do with pretraining. I think we're not as worried about overlapping partitions in the pretraining data. It seems like good practice to have this check as the default, but maybe either add an option to not check the pretraining data and/ or to print a warning rather than fail if there are overlaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change the hard-coded True/False to a user-specified value.
But why would overlapping partitions be okay for pretraining?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely keep it as the default to check, but maybe allow an override? I'm not sure when it would be used, and maybe I'm conflating partitioning and holdouts in my head. I know that for the gw project, we only kept our holdout reaches out of the fine-tuning training, we let them be in the pretraining. And actually, river-dl doesn't currently do anything with the validation and testing partitions of the pretraining data. (that's not a very clear answer :))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That holdout description makes sense to me. You're basically building an emulator of a process model for the study region, and I don't think it's cheating to use all reaches in that step because a process model can be run anywhere that has the necessary input data. However, it would be important to remove the real observations for holdout reaches when training the process model. Otherwise, some information from those observations would inform weights and biases of the ML model.

river-dl doesn't currently do anything with the validation and testing partitions of the pretraining data

It looks like the model training functions could use validation or testing data (e.g., for optimizing hyperparameters during pretraining). Is your statement more about how pretraining has been implemented in workflows?

Anyway, okay, I'll write in an argument that can override the pretraining check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetrbarclay I added a function argument that allows overriding the pretraining check. Let me know if this is good to merge.

data = data[~data[spatial_idx_name].isin(val_sites)]

if not group:
if partition == 'trn':
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetrbarclay here is the new handling of spatial data holdouts. It's similar to the preproc_utils, except this is first checking for partition of the data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for doing this!

@@ -671,14 +677,27 @@ def prep_y_data(
)


# replace validation sites' (and test sites') data with np.nan
# replace trn, val and tst sites' data with np.nan
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetrbarclay here is the new handling of spatial data holdouts in preproc_utils

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good too.

Copy link
Collaborator

@janetrbarclay janetrbarclay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for all your work on this! (and your patience with our slow responses)

@jds485 jds485 linked an issue Apr 20, 2023 that may be closed by this pull request
@jds485 jds485 mentioned this pull request Apr 24, 2023
Copy link
Collaborator

@jsadler2 jsadler2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, Jared.

@jds485
Copy link
Member Author

jds485 commented Apr 26, 2023

I added 2 final commits based on discussion in #218. I am going to merge this because the commits have been reviewed there.

@jds485 jds485 merged commit 3d956e1 into main Apr 26, 2023
@jds485 jds485 deleted the jds-edits branch April 26, 2023 17:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"seg_id_nat" hardcoded in evaluate.py
3 participants