Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolate StandardScalar from load dataset #39

Open
akashshah59 opened this issue Jun 21, 2021 · 1 comment
Open

Isolate StandardScalar from load dataset #39

akashshah59 opened this issue Jun 21, 2021 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@akashshah59
Copy link
Collaborator

The current implementation of data.py/load_dataset() instantiates a standard scaler by default.

def load_dataset(dataset_dir, batch_size, val_batch_size=None, test_batch_size=None):
    if val_batch_size is None:
        val_batch_size = batch_size

    if test_batch_size is None:
        test_batch_size = batch_size

    data = {}

    for category in ["train", "val", "test"]:
        cat_data = np.load(os.path.join(dataset_dir, category + ".npz"))
        data["x_" + category] = cat_data["x"]
        data["y_" + category] = cat_data["y"]

    scaler = StandardScaler(data["x_train"][..., 0])

    for category in ["train", "val", "test"]:
        data["x_" + category][..., 0] = scaler.transform(data["x_" + category][..., 0])
        data["y_" + category][..., 0] = scaler.transform(data["y_" + category][..., 0])

    data_train = PaddedDataset(batch_size, data["x_train"], data["y_train"])
    data["train_loader"] = DataLoader(data_train, batch_size, shuffle=True)

    data_val = PaddedDataset(val_batch_size, data["x_val"], data["y_val"])
    data["val_loader"] = DataLoader(data_val, val_batch_size, shuffle=False)

    data_test = PaddedDataset(test_batch_size, data["x_test"], data["y_test"])
    data["test_loader"] = DataLoader(data_test, test_batch_size, shuffle=False)

    data["scaler"] = scaler
    return data

The goal is to be able to isolate the scalar from the data loading method, and support more scalars eventually.

@akashshah59 akashshah59 added the enhancement New feature or request label Jun 21, 2021
@akashshah59 akashshah59 self-assigned this Jun 21, 2021
@akashshah59
Copy link
Collaborator Author

akashshah59 commented Jun 22, 2021

@klane and @yuqirose Shouldn't scalar be a part of data preprocessing and not part of the model?

Currently, our forward step has a scaler implemented, however, does it make sense to have it defined outside the model step function by the user?

    def _step(self, batch, batch_idx, num_batches):
        x, y = self.prepare_batch(batch)

        if self.training:
            batches_seen = batch_idx + self.current_epoch * num_batches
        else:
            batches_seen = batch_idx

        pred = self(x, y, batches_seen)

        if self.scaler is not None:
            y = self.scaler.inverse_transform(y)
            pred = self.scaler.inverse_transform(pred)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant