Skip to content
This repository has been archived by the owner on Jun 2, 2023. It is now read-only.

Multiple train/test splits result in discontinuous batches #127

Open
SimonTopp opened this issue Aug 17, 2021 · 5 comments
Open

Multiple train/test splits result in discontinuous batches #127

SimonTopp opened this issue Aug 17, 2021 · 5 comments

Comments

@SimonTopp
Copy link
Contributor

SimonTopp commented Aug 17, 2021

for i in range(int(1 / offset)):
start = int(i * offset * seq_len)
idx = np.arange(start=start, stop=data_array.shape[1] + 1, step=seq_len)
split = np.split(data_array, indices_or_sections=idx, axis=1)
# add all but the first and last batch since they will be smaller
combined.extend([s for s in split if s.shape[1] == seq_len])

Here, if we have discontinuous training and testing groups (i.e. two sets of date ranges for both), and batches are set to anything other than 365, then I think this results in one batch that starts in the first date range and ends in the second. I think we should first group by water year, then split into batches and just pad and/or drop the last one. What do you all think?

@jsadler2
Copy link
Collaborator

Interesting. It's been a while since I wrote this (or thought about this ... or used this 😄). Have you confirmed that this is what happens?

@jdiaz4302
Copy link
Collaborator

jdiaz4302 commented Nov 4, 2021

This may be of interest as confirmation that multiple train/test splits result in discontinuous batches sequences.

discontinuous_sequence_dates

@janetrbarclay
Copy link
Collaborator

Further confirmation if you look at the temps in a single sample (these are observed temps, # in the title is the seg_id)
image

@jdiaz4302
Copy link
Collaborator

Using the existing reduce_training_data_continuous function from the river_dl/preproc_utils.py file can help get continuous batches with nan values. For example, here is the 365-day sequence for pretraining and finetuning Ys when I applied it to only the finetuning Y (the gap in the finetuning Y is where the nans have been placed - summer):

continuous_batch

If you apply reduce_training_data_continuous to the x variables, you end up with nan in the predictions and subsequently the loss function. Taking this approach in #142 by applying reduce_training_data_continuous to only the Y array (and not the pretraining Y or X arrays) led to much worse RMSE (factor of 2). I assume this is because the model is exposed to out of bound x values that have no corresponding out of bound Y values (set to nan) but still associated with the 365-day sequence of other values, so it may lead to some misleading learning.

@jds485
Copy link
Member

jds485 commented May 23, 2023

I think this issue has been addressed from #218

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants