Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skorch.fit can't handle lists of lists with variable length #605

Closed
econti opened this issue Mar 6, 2020 · 39 comments
Closed

skorch.fit can't handle lists of lists with variable length #605

econti opened this issue Mar 6, 2020 · 39 comments
Labels

Comments

@econti
Copy link

econti commented Mar 6, 2020

I'm having a hard time figuring out how to pass a list of lists (with variable length) to skorch's fit method.

Specifically, I have a feature that is a list of ID's (e.g. [[1, 12, 3], [6, 22]...]) which are converted to a dense representation using an embedding table in my PyTorch module's forward method:

def forward(self, X_float, X_id_list):
    ...

When I call net.fit() on my data set (e.g. {"X_float": ..., "X_id_list": ...} I get the following error caused by the list of lists:

ValueError: Dataset does not have consistent lengths.

I've also tried converting the list of lists to a pandas dataframe and numpy array (of objects) and neither works. How do you handle variable length lists of lists in skorch.fit?

@cgarciae
Copy link

cgarciae commented Mar 7, 2020

Don't know about the specifics of this in skorch but generally you need to add padding / perform slicing so every sample has the same length. The only exception of this are Tensorflow's Ragged Tensors, but even then you have to specify a default value to pad with when converting to regular tensors (Pytorch doesn't have Ragged Tensors yet).

@BenjaminBossan
Copy link
Collaborator

@econti Could you check whether PackedSequence solves your issue?

Otherwise, we have an example here that shows how to potentially deal with variable length sequences.

@econti
Copy link
Author

econti commented Mar 9, 2020

Thanks @BenjaminBossan, that did the trick for me. Leaving a code snippet here for anyone else who encounters a similar issue:

# data["X_id_list"] is a pandas dataframe that hold variable length lists of lists, e.g.
# [[1, 3], [0, 40, 16], ...]

X_id_list = {}

for series_name, series in data["X_id_list"].iteritems():
    pre_pad = [torch.tensor(i) for i in series]
    X_id_list[series_name] = pad_sequence(
        pre_pad, batch_first=True, padding_value=0
    )

@econti econti closed this as completed Mar 9, 2020
@BenjaminBossan
Copy link
Collaborator

@econti Great that you found a solution and thanks for the snippet.

@ToddMorrill
Copy link

I'm facing a similar issue right now and I suspect I'm doing the same thing that you're doing, which is padding to the longest sequence length in the dataset, which results in significantly more computation than would result from padding at the batch level. I suspect we need something like a collate_fn that operates at the batch level to solve this the right way.

@BenjaminBossan
Copy link
Collaborator

@ToddMorrill I don't know the exact details of your case, so maybe I'm missing something. In general through, collate_fn is designed to work on samples and not on batches. If you want to avoid any costly operation on each sample, you would have to provide your own DataLoader. You can pass it as iterator_train and iterator_valid to NeuralNet in skorch.

However, this is not the canonical way o fdealing with sequences of different lengths. Maybe you can make use of PackedSequence or pad_sequence.

@ToddMorrill
Copy link

"A custom collate_fn can be used to customize collation, e.g., padding sequential data to max length of a batch." Source That's what I'm trying to do.

Thanks for pointing me toward NeuralNet. I was using NeuralNetClassifier and totally missed the opportunity to use a custom DataLoader. I'll give that a shot.

I'm not opposed to using pad_sequence, it's just that I got started with torchtext and it was already doing a fantastic job taking care of all my text preprocessing needs, including padding, so I didn't want to rewrite that functionality.

@ToddMorrill
Copy link

To be sure, I'm trying to reuse the following torchtext code with skorch.

import torchtext
from torchtext import data
from torchtext import datasets

# set up fields
TEXT = data.Field(lower=True, batch_first=True, )
LABEL = data.Field(sequential=False, unk_token=None)

# takes approx. 10 minutes to download data and embeddings (will be cached for re-use)
# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

# will be used to initialize model embeddings layer
vocab = torchtext.vocab.GloVe(name='6B', dim=100)

# build the vocabulary
max_size = 25_000 # shorten for demonstrative purposes
TEXT.build_vocab(train, vectors=vocab, max_size=max_size)
LABEL.build_vocab(train)

# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits((train, test), batch_sizes=(32, 64), device='cpu')

So far, I haven't found a way to reuse train_iter with Skorch. train_iter is used in a for loop and yields batches of data padded to the longest length sequence in the batch. It also buckets batches by sequence length to reduce computation. Each batch has a .text and a .label attribute that contain the numericalized data and label representation, respectively.

I welcome any suggestions on recycling this code.

@ToddMorrill
Copy link

My apologies for all the posts but I just wanted to share a quick update before signing off and ask a question.

I created a custom dataset and then implemented a custom collate_fn as follows:

def pad_batch(batch):
    text, label = list(zip(*batch))
    padded_batch = pad_sequence(text, batch_first=True, padding_value=1)
    return padded_batch, torch.cat(label)

skorch_model = NeuralNet(
                CNN,
                device=device,
                max_epochs=2,
                lr=0.001,
                optimizer=optim.Adam,
                criterion=nn.NLLLoss,
                iterator_train__collate_fn=pad_batch,
                iterator_train__shuffle=True,
                iterator_valid__collate_fn=pad_batch,
                iterator_valid__shuffle=False,
                train_split=skorch.dataset.CVSplit(.2), # NB: this witholds 20% of the training data for validation
                module__n_filters=100,
                module__filter_sizes=(2,3,4),
                module__dropout=0.2,
                module__pretrained_embeddings=TEXT.vocab.vectors,
                batch_size=32,
                verbose=2)

skorch_model.fit(train_dataset)

What's amazing about padding at the batch level is that run times went from 60 seconds per epoch to 20 seconds per epoch - a huge improvement. However, I was liking all of the functionality I had while using NeuralNetClassifier, namely all of the scoring functions. NeuralNetClassifier insists on having skorch_model.fit(X, y) and fails with skorch_model.fit(train_dataset). Do you have a way around this so that I can use NeuralNetClassifier with my custom dataset and custom dataloader?

I'm still interested in recycling the torchtext functionality so if you have thoughts on that, I still welcome them!!

Thanks for all of your help! I'm loving skorch.

@BenjaminBossan
Copy link
Collaborator

Thanks for all of your help! I'm loving skorch.

That's great to hear, thanks.

Thanks for pointing me toward NeuralNet. I was using NeuralNetClassifier and totally missed the opportunity to use a custom DataLoader. I'll give that a shot.

Sorry that I have confused you, you can do the same thing with NeuralNetClassifier, I just used NeuralNet as a stand in for all the derived classes.

NeuralNetClassifier insists on having skorch_model.fit(X, y) and fails with skorch_model.fit(train_dataset)

It depends a bit. What does your target look like? Potentially, it could be possible to extract it and pass it as y. But that only really makes sense if you work on a (multiclass) classification problem -- is that the case for your dataset? If you want to do, say, seq2seq, I don't see how that can work with NeuralNetClassifier.

namely all of the scoring functions

Note that you can use the scoring functions also with NeuralNet, have a look at EpochScoring.

@ToddMorrill
Copy link

I'm making progress on my example text classification pipeline using NeuralNetClassifier. Have a look here. I managed to recycle the useful parts of torchtext (e.g. TEXT.process(batch), etc.) but did indeed have to use a custom collate_fn inside of DataLoader. Most importantly to me, run times have been reduced dramatically. I think there is potential to speed things up further if we can make use of a bucket iterator like the one in torchtext. I'll bet torch.nn.utils.rnn.pack_padded_sequence would be helpful here, as you pointed out @BenjaminBossan, but it just requires me to implement more functionality. The bottom line is I was hoping to make use if torchtext's functionality from start to finish. That does not appear to be possible with Skorch at this stage. If there is anything I can do to help make this possible, please let me know.

@BenjaminBossan
Copy link
Collaborator

I believe it makes a lot of sense to make skorch work with popular libraries like torchtext and torchvision. When we released skorch, the former didn't exist yet, so now we might be in a place where not everything works together. However, there might still be a way. I would need to look more thoroughly at what torchtext provides and see what we can do, once I have a bit of time.

@ToddMorrill please keep us up-to-date if you find some better solution.

@kqf
Copy link
Contributor

kqf commented May 31, 2020

Hi guys,
Sorry for the noise if it's not actual anymore. But I wasn't able to find any usage of skorch + torchtext, and this is the only thread that bumps up in google.

@ToddMorrill

I think there is potential to speed things up further if we can make use of a bucket iterator like the one in torchtext.

I have good news for you :D skorch supports pytorch datasets, the same convention is followed by torchtext. In fact, all their datasets are inherited from torch.utils.data.Dataset. In theory, this makes them compatible with skorch.
As for me, it's a beautiful example of great design and implementation. Both teams followed the same conventions imposed by pytorch and ended up with two independent libraries that are compatible with each other.

Here I prepared a short example (somewhat similar to the one provided by @ToddMorrill ) how to integrate torchtext into skorch pipeline:

import torch
import skorch
import random
import numpy as np
import pandas as pd
from torchtext.data import BucketIterator, Example, Dataset, Field, LabelField
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline

SEED = 137

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


def data(size=1000):
    return pd.DataFrame({
        "query": ["This is a duck", "This is a goose"] * size,
        "target": [0, 1] * size,
    })


class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, fields, need_vocab=None):
        self.fields = fields
        self.need_vocab = need_vocab or {}

    def fit(self, X, y=None):
        dataset = self.transform(X, y)
        for field, min_freq in self.need_vocab.items():
            field.build_vocab(dataset, min_freq=min_freq)
        return self

    def transform(self, X, y=None):
        proc = [X[col].apply(f.preprocess) for col, f in self.fields]
        examples = [Example.fromlist(f, self.fields) for f in zip(*proc)]
        return Dataset(examples, self.fields)


def build_preprocessor():
    text_field = Field(lower=True)
    label_field = LabelField(is_target=True)
    fields = [
        ('query', text_field),
        ('target', label_field),
    ]
    return TextPreprocessor(fields, need_vocab={text_field: 0, label_field: 0})


class SimpleModule(torch.nn.Module):
    def __init__(self, vocab_size=100, emb_dim=16, lstm_hidden_dim=32):
        super().__init__()
        self._emb = torch.nn.Embedding(vocab_size, emb_dim)
        self._rnn = torch.nn.LSTM(emb_dim, lstm_hidden_dim)
        self._out = torch.nn.Linear(lstm_hidden_dim, 2)

    def forward(self, inputs):
        rnn_output = self._rnn(self._emb(inputs))[0]
        return torch.nn.functional.softmax(self._out(rnn_output[-1]))


class InputShapeSetter(skorch.callbacks.Callback):
    def on_train_begin(self, net, X, y):
        # NB: If your module relies on pretrained embeddings
        # net.set_params(module__embeddings=X.fields["query"].vocab.vectors)
        pass


def build_model():
    model = skorch.NeuralNetClassifier(
        module=SimpleModule,
        iterator_train=BucketIterator,
        iterator_valid=BucketIterator,
        train_split=Dataset.split,
        callbacks=[InputShapeSetter()],
    )
    full = make_pipeline(
        build_preprocessor(),
        model
    )
    return full


def main():
    df = data()
    assert type(df) == pd.DataFrame

    dataset = build_preprocessor().fit_transform(df)
    assert type(dataset) == Dataset

    # Putting it all together
    model = build_model().fit(
        df,  # pd.DataFrame, torchtext handles X and y
        0.7  # <<< ?? This sets split_ratio for Dataset.split
    )
    print(model.predict(df))
    assert model.score(df, df["target"]) > 0.5, "Fitting issues"


if __name__ == '__main__':
    main()

This code should work with the latest versions of the libraries. The only strange thing is that you have to pass split_ratio=0.7 through .fit method. I guess, this side effect is caused by this line in the skorch code. Perhaps, there's a better solution for this.

@BenjaminBossan It looks like you are a member of the dev team. Probably 594 is somehow related to the topic with torchtext. If you will raise an error on IterableDataset then you will lose this torchtext support. I might be wrong.

Once again sorry for spamming.

@BenjaminBossan
Copy link
Collaborator

BenjaminBossan commented May 31, 2020

@kqf Thanks for posting the example, I'm taking a look at it. At the end of the day, I think it would be nice to add a notebook that showcases how to use torchtext. Ideally, it should use one of the torchtext datasets like IMDB and pretrained embeddings.

The only strange thing is that you have to pass split_ratio=0.7 through .fit method

Yes, that works, but it's a bit of a hacky solution. This solution here should be clearer:

from functools import partial

def my_train_split(dataset, y, split_ratio):
    return dataset.split(split_ratio=split_ratio)

...

def build_model():
    model = skorch.NeuralNetClassifier(
        module=SimpleModule,
        iterator_train=BucketIterator,
        iterator_valid=BucketIterator,
        train_split=partial(my_train_split, split_ratio=0.7),
        callbacks=[InputShapeSetter()],
    )
    ...

model = build_model().fit(df)  # no need to pass split_ratio here

@kqf
Copy link
Contributor

kqf commented Jun 1, 2020

@BenjaminBossan

it should use one of the torchtext datasets like IMDB and pretrained embeddings.

It's totally doable, I didn't want to download the data/embeddings on my private laptop.

Yes, that works, but it's a bit of a hacky solution. This solution here should be clearer:

Yes, I agree, but that was the one of my intentions: to demonstrate that skorch is compatible with torchtext without extra code and and to show a strange skoch behaviour. I would expect if I pass.fit(X, y=None) then y will not be passed to the split function. I think it should be handled on skorch side and it deserves an issue on it's own 🤷

What do you think?

@BenjaminBossan
Copy link
Collaborator

It's totally doable, I didn't want to download the data/embeddings on my private laptop.

Yes, what you posted is a really good starting point.

without extra code

I think those two lines are acceptable :)

I would expect if I pass.fit(X, y=None) then y will not be passed to the split function.

I think that could make sense. Do you want to work on this change?

In the meantime, I tried to implement a torchtext example with skorch that's a bit closer to a real world problem someone could have. It uses skorch with torchtext and BERT (via huggingface). Here is the notebook:

https://nbviewer.jupyter.org/github/BenjaminBossan/playground/blob/master/skorch_torchtext_bert.ipynb

@kqf @ToddMorrill since you know torchtext much better than I do, could you check if what I did makes sense? E.g., I don't really understand what all this Field, TEXT, LABEL, and build_vocab stuff does. For reference, my notebook is basically a re-implementation of this notebook.

The main change that I had to introduce was to slightly change BucketIterator:

class SkorchBucketIterator(BucketIterator):
    def __iter__(self):
        for batch in super().__iter__():
            # We make a small modification: Instead of just returning batch
            # we return batch.text and batch.label, corresponding to X and y
            yield batch.text, batch.label.long()

skorch basically really wants to always have an X and a y, because this is what sklearn expects. With the shown change, we get that. (I didn't quite get why batch.label is int32, surely there is a better way to change that.) Apart from this, I could re-use most of the code from the original notebook.

ping @ottonemo maybe this is also interesting for you.

@BenjaminBossan BenjaminBossan reopened this Jun 1, 2020
@kqf
Copy link
Contributor

kqf commented Jun 1, 2020

I think that could make sense. Do you want to work on this change?

Yes, I'd love to help, but I will have time only on weekends. If it's ok -- I am in.

since you know torchtext much better than I do, could you check if what I did makes sense? E.g., I don't really understand what all this Field, TEXT, LABEL, and build_vocab stuff does. For reference, my notebook is basically a re-implementation of this notebook.

I am not an expert in torchtext either, but your code looks fine. Those TEXT and LABEL are the instances of the field class. Fields are "applied" to examples to extract the information needed. The fields define all necessary transformations, and build_vocab is similar to .fit method for transformers (so you have to apply it to the train data only).

I like the way you are handling torchtext.data.Batch. It's really a good one.

skorch basically really wants to always have an X and a y, because this is what sklearn expects

I think this is important what you are saying. The default NeuralNet was designed to be a supervised model. Today, there are more and more unsupervised and semi-supervised DL applications, so maybe it will make some sense add UnsupervisedNeuralNet or something like this. I think this still will be compatible with sklearn as they have support for clustering and Manifold learning.

@kqf
Copy link
Contributor

kqf commented Jun 1, 2020

@BenjaminBossan One more thing about examples with torchtext and it is directly related to the issue. Today I was trying to use skorch together with torchtext for metric learning. For this problem, you have to pass two fields to the forward method, and y should remain empty. I will not provide the full example here, as it may be a bit lengthy, but, probably it will be useful to have a notebook that shows how to achieve that?

In any event, if you have to pass multiple fields to forward method, you have to do two modifications:

  1. Edit the bucket iterator (similarly to the BERT example):
from operator import attrgetter


def batch2dict(batch):
    return {f: attrgetter(f)(batch) for f in batch.input_fields}


class SkorchBucketIterator(BucketIterator):
    def __iter__(self):
        for batch in super().__iter__():
            # We make a small modification: Instead of just returning batch
            # we return dict() and empty tensor, corresponding to X and y
            yield batch2dict(batch), torch.empty(0)
  1. You have to use Field(batch_first=True) option when creating the fields, otherwise skorch will complain about the inconsistent length of the dataset

So, this should demonstrate how to use multiple fields with skorch, hope someone will find it useful.

@BenjaminBossan
Copy link
Collaborator

Yes, I'd love to help, but I will have time only on weekends. If it's ok -- I am in.

No problem at all. If you need help along the way, just ask.

The default NeuralNet was designed to be a supervised model. Today, there are more and more unsupervised and semi-supervised DL applications, so maybe it will make some sense add UnsupervisedNeuralNet or something like this. I think this still will be compatible with sklearn as they have support for clustering and Manifold learning.

NeuralNetClassifier, NeuralNetBinaryClassifier, and NeuralNetRegressor are explicitly modeled to be for supervised learning. NeuralNet is more open-ended and should be used for anything unsupervised. As with sklearn's unsupervised models, we support calling fit(X) without passing y there.

So, this should demonstrate how to use multiple fields with skorch, hope someone will find it useful.

Thanks for providing the example.

Today I was trying to use skorch together with torchtext for metric learning. For this problem, you have to pass two fields to the forward method, and y should remain empty.

I'm curious what exactly you are doing there. I implemented some metric learning approaches in the past, typically using something like a Siamese net. You could use the target to indicate which samples belong together. I moved the main logic for the metric learning to the criterion, so that the module was just returning the representations. But that might not fit your use case. And if you want to add goodies like triplet mining, it can become complicated fast (see discussion here).

@kqf
Copy link
Contributor

kqf commented Jun 2, 2020

I'm curious what exactly you are doing there.

If you ask about the application, it's a chatbot (there is a database with replies, so the model needs to find the most relevant one when supplied with the user query). And it's very similar to the example you mentioned. In my case, it's somewhat easier as I do hard and semi-hard negatives mining within a batch. I decided to separate the logic: I have a separate encoder-towers module and a loss module that mines hard-negatives and calculates the triplet loss. I didn't know about skorch.toy, thanks.

You could use the target

What do you mean by target? Is it Field(is_target=True)?

@BenjaminBossan
Copy link
Collaborator

You could use the target

I just meant that the y could contain, for instance, the clusters that your samples belong to, so that you could use it for mining. But it seems you already found a solution that works for you 👍

@ToddMorrill
Copy link

class SkorchBucketIterator(BucketIterator):
    def __iter__(self):
        for batch in super().__iter__():
            # We make a small modification: Instead of just returning batch
            # we return batch.text and batch.label, corresponding to X and y
            yield batch.text, batch.label.long()

This is working really well. My epoch times were pretty much cut in half with this modification. Thank you for your example @BenjaminBossan.

I just tried to plug this into a grid search like the following and got an error. I'm including the traceback for reference. I can try to look into the error but I'm not too familiar with sklearn's internals. Is there a way forward here?

search = RandomizedSearchCV(skorch_model, params, n_iter=2, verbose=2, refit=False, scoring='accuracy', cv=5)
search.fit(X=dev_dataset, y=None)

Traceback

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-fa3b744e6e6b> in <module>
      1 search = RandomizedSearchCV(skorch_model, params, n_iter=2, verbose=2, refit=False, scoring='accuracy', cv=5)
----> 2 search.fit(X=dev_dataset, y=None)

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    648             refit_metric = 'score'
    649 
--> 650         X, y, groups = indexable(X, y, groups)
    651         fit_params = _check_fit_params(X, fit_params)
    652 

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    246     """
    247     result = [_make_indexable(X) for X in iterables]
--> 248     check_consistent_length(*result)
    249     return result
    250 

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    206     """
    207 
--> 208     lengths = [_num_samples(X) for X in arrays if X is not None]
    209     uniques = np.unique(lengths)
    210     if len(uniques) > 1:

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in <listcomp>(.0)
    206     """
    207 
--> 208     lengths = [_num_samples(X) for X in arrays if X is not None]
    209     uniques = np.unique(lengths)
    210     if len(uniques) > 1:

~/Documents/regtech/.venv/lib/python3.7/site-packages/sklearn/utils/validation.py in _num_samples(x)
    148 
    149     if hasattr(x, 'shape') and x.shape is not None:
--> 150         if len(x.shape) == 0:
    151             raise TypeError("Singleton array %r cannot be considered"
    152                             " a valid collection." % x)

TypeError: object of type 'generator' has no len()

@BenjaminBossan
Copy link
Collaborator

@ToddMorrill Thanks for reporting.

It's not quite easy for me to deduce what's going on. Could you either provide me a minimal code sample to reproduce the error or check the following things for me (by using a debugger):

  1. could you please check if the code runs when you wrap your dataset using skorch's SliceDataset?
  2. what is the type of x and x.shape in the last step?
  3. In this line:
--> 650         X, y, groups = indexable(X, y, groups)

what are the types of X and y?

  1. Here: search.fit(X=dev_dataset, y=None), what is the type of dev_dataset?
  2. When fitting without RandomizedSearchCV, everything runs fine?

@ToddMorrill
Copy link

I was able to reproduce it with your example by adding the following lines to the bottom of the script.

params = {'module__hidden_dim': [128, 256],
          'module__n_layers': [1, 2],
          'module__bidirectional': [False, True],
          'module__dropout': [0.2, 0.25]}

from sklearn.model_selection import RandomizedSearchCV
search = RandomizedSearchCV(net, params, n_iter=2, verbose=2, refit=False, scoring='accuracy', cv=5)

# we can set y=None because the labels are contained inside the dataset
search.fit(ds_train, y=None)

could you please check if the code runs when you wrap your dataset using skorch's SliceDataset?

X_sl = SliceDataset(ds_train)
search.fit(X_sl, y=None)

Running this results in the following output. No errors but it didn't train.

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.2, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[CV] module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True 
[CV]  module__n_layers=2, module__hidden_dim=128, module__dropout=0.25, module__bidirectional=True, total=   0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.2s finished

what is the type of x and x.shape in the last step?

From the debugger:

type(x) == torchtext.datasets.imdb.IMDB
type(x.shape) == generator

I believe x is just type(ds_train) == torchtext.datasets.imdb.IMDB. x.shape (i.e. ds_train.shape) results in a generator. I was able to reproduce the error with the following code.

len(ds_train.shape)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-e50229cbe695> in <module>
----> 1 len(ds_train.shape)

TypeError: object of type 'generator' has no len()

In this line:

--> 650         X, y, groups = indexable(X, y, groups)

what are the types of X and y?

From the debugger:

type(X) == torchtext.datasets.imdb.IMDB
type(y) == None

Here: search.fit(X=dev_dataset, y=None), what is the type of dev_dataset?

type(dev_dataset) == torchtext.data.dataset.Dataset

When fitting without RandomizedSearchCV, everything runs fine?

Yes, it's fantastic!

@BenjaminBossan
Copy link
Collaborator

Thanks for investigating @ToddMorrill

I tracked down the weird generator error and this is the cause:

https://github.com/pytorch/text/blob/c57369cb1049b4ecb075f6f766494ed3842269d1/torchtext/data/dataset.py#L151-L154

In my opinion, this a bug on the torchtext library, since, for any unknown attribute, calling it on a dataset will return an empty generator. If the attribute is not known, they should definitely raise an AttributeError (as prescribed by the Python docs). However, for that to happen, __getattr__ should not be a generator.

Basically every code that calls

  • hasattr(dataset, attr)
  • foo = getattr(dataset, attr, None); if foo ...
  • try: dataset.foo ... except AttributeError: ...

with an unknown attribute will do the wrong thing. This is especially grave with sklearn, since sklearn will at one point check hasattr(X, 'loc') or hasattr(X, 'iloc') to determine if the input is a pandas DataFrame; obviously this will cause a lot of trouble.

I tried to override their __getattr__ like this:

    def __getattr__(self, attr):
        if attr in self.fields:
            [getattr(x, attr) for x in self.examples]
        else:
            raise AttributeError("no attribute", attr)

However, then I run into the next problem, namely these lines:

https://github.com/pytorch/text/blob/c57369cb1049b4ecb075f6f766494ed3842269d1/torchtext/data/field.py#L288-L289

They basically rely on the faulty __getattr__ behavior there. At that point, I gave up, who knows how many parts if their code are still affected by this.

So overall, I'm sorry to say that you might just not be able to combine RandomizedSearchCV with the torchtext example without some serious hacking. At least I see no easy fix. But for me, this is a problem on the torchtext side and I wouldn't want to implement any fixes on the skorch side. Perhaps you can compel the torchtext devs to fix the issue but it could be hard to do that.

@ToddMorrill
Copy link

Thanks for that explanation @BenjaminBossan. I filed a bug with torchtext. Let's see if they pick it up.

@ToddMorrill
Copy link

Quick update on this. torchtext is rolling out some new design patterns that more closely mirror torch.utils.data.

This describes their plans a bit more. I'm hoping in the long run this will make torchtext more seamlessly compatible with skorch and sklearn.

@BenjaminBossan
Copy link
Collaborator

Thanks for reporting back. I read it but since I'm not familiar with torchtext, I can't really judge the changes. The general idea seems to be good. Whether it makes it easier to integrate with skorch will have to be seen.

@ToddMorrill do you have any experience with using the facilities provided by huggingface instead of torchtext? I wonder if those cooperate better with skorch. I think it could also be interesting to provide sklearn transformers to wrap their tokenizers, which would allow to integrate them into an sklearn pipeline.

@ToddMorrill
Copy link

I haven't had a chance to use huggingface's tools, but it's on my current project's roadmap. I'll share if I get anything running.

@ToddMorrill
Copy link

Hey @BenjaminBossan, quick question. Circling back to my comment above - would it be possible to use RandomizedSearchCV when y=None? I'm working on a little project that uses torch.utils.data.Dataset and torch.utils.data.DataLoader. Everything works fine with vanilla training (i.e. skorch_model.fit(train_dataset, y=None)) but when I try the same setup with search.fit(train_dataset, y=None) I got TypeError: fit() missing 1 required positional argument: 'y'. I can see that y=None is possible for unsupervised learning but naturally, my goal is supervised learning.

@BenjaminBossan
Copy link
Collaborator

@ToddMorrill could you try if one of these three proposals works for you?

  1. Pass a dummy value as y with the correct shape (might not work, depending on the metric)
  2. Extract your y value from your dataset (e.g. y = torch.cat([dataset[i][1] for i in range(len(dataset))]).numpy())
  3. Pass y=SliceDataset(dataset, idx=1), assuming that index 1 is your target (details)

@ToddMorrill
Copy link

Good thoughts!

I tried all 3 techniques and you can see the example that I'm working on for the dask team here. There's a section in this notebook titled "Grid search with Skorch" where you'll see all 3 attempts that all resulted in ValueError: Dataset does not have consistent lengths.

@BenjaminBossan
Copy link
Collaborator

Could you please paste the full stack trace for the error? I assume it's the same for all 3 cases?

@ToddMorrill
Copy link

Indeed, the error and stack trace were the same for all 3 cases. Here it is.

Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/classifier.py", line 142, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 854, in fit
    self.partial_fit(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 813, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 717, in fit_loop
    X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1198, in get_split_datasets
    dataset = self.get_dataset(X, y)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1153, in get_dataset
    return dataset(X, y, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/skorch/dataset.py", line 165, in __init__
    len_X = get_len(X)
  File "/opt/conda/lib/python3.7/site-packages/skorch/dataset.py", line 76, in get_len
    raise ValueError("Dataset does not have consistent lengths.")
ValueError: Dataset does not have consistent lengths.

  FitFailedWarning)
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: dropout, filter_sizes, n_filters, pretrained_embeddings.
Re-initializing optimizer.
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.6472       0.6690        0.5837  1.7154
      2        0.4745       0.8010        0.4465  1.5456

@BenjaminBossan
Copy link
Collaborator

This is interesting, it looks like it works a few times and then suddenly it breaks.

Could you please initialize the net with skorch_model = NeuralNetClassifier(..., dataset=TorchDataset) and see if that works? This parameter determines what dataset is used for the skorch internal split and, as is, the skorch.dataset.Dataset is used, which is not what you want.

After trying that, regardless of if it helps, please do the following: skorch_model = NeuralNetClassifier(..., train_split=False). This turns off the skorch internal train/valid split and should also prevent the error. Typically, when you perform a hyper-paramter search, you don't need the skorch internal split, since sklearn will already take care of splitting the data for you.

@ToddMorrill
Copy link

FWIW, the default value for the refit parameter in RandomizedSearchCV is True, so I think the one success you're seeing might be the result of that. After setting refit=False that one success disappears.

This turns off the skorch internal train/valid split and should also prevent the error. Typically, when you perform a hyper-paramter search, you don't need the skorch internal split, since sklearn will already take care of splitting the data for you.

Makes sense, thanks for the insight.

Running with skorch_model = NeuralNetClassifier(..., dataset=TorchDataset) yields this error for all 3 approaches outlined above both with skorch_model = NeuralNetClassifier(..., train_split=skorch.dataset.CVSplit(.2)) and with skorch_model = NeuralNetClassifier(..., train_split=False).

/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/classifier.py", line 142, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 854, in fit
    self.partial_fit(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 813, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 717, in fit_loop
    X, y, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1198, in get_split_datasets
    dataset = self.get_dataset(X, y)
  File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1153, in get_dataset
    return dataset(X, y, **kwargs)
TypeError: __init__() takes 2 positional arguments but 3 were given

  FitFailedWarning)

Do you think this is because my custom TorchText class is only expecting 1 argument, namely train_dataset and not y?

@BenjaminBossan
Copy link
Collaborator

That's a bit strange:

File "/opt/conda/lib/python3.7/site-packages/skorch/net.py", line 1153, in get_dataset
return dataset(X, y, **kwargs)

This code path should never be reached because this line comes before it:

skorch/skorch/net.py

Lines 1154 to 1155 in 6fe94fd

if is_dataset(X):
return X

Could you maybe turn on the debugger and check the value of X at that point?

BenjaminBossan added a commit that referenced this issue Aug 30, 2020
This release of skorch contains a few minor improvements and some nice additions. As always, we fixed a few bugs and improved the documentation. Our [learning rate scheduler](https://skorch.readthedocs.io/en/latest/callbacks.html#skorch.callbacks.LRScheduler) now optionally logs learning rate changes to the history; moreover, it now allows the user to choose whether an update step should be made after each batch or each epoch.

If you always longed for a metric that would just use whatever is defined by your criterion, look no further than [`loss_scoring`](https://skorch.readthedocs.io/en/latest/scoring.html#skorch.scoring.loss_scoring). Also, skorch now allows you to easily change the kind of nonlinearity to apply to the module's output when `predict` and `predict_proba` are called, by passing the `predict_nonlinearity` argument.

Besides these changes, we improved the customization potential of skorch. First of all, the `criterion` is now set to `train` or `valid`, depending on the phase -- this is useful if the criterion should act differently during training and validation. Next we made it easier to add custom modules, optimizers, and criteria to your neural net; this should facilitate implementing architectures like GANs. Consult the [docs](https://skorch.readthedocs.io/en/latest/user/neuralnet.html#subclassing-neuralnet) for more on this. Conveniently, [`net.save_params`](https://skorch.readthedocs.io/en/latest/net.html#skorch.net.NeuralNet.save_params) can now persist arbitrary attributes, including those custom modules.
As always, these improvements wouldn't have been possible without the community. Please keep asking questions, raising issues, and proposing new features. We are especially grateful to those community members, old and new, who contributed via PRs:

```
Aaron Berk
guybuk
kqf
Michał Słapek
Scott Sievert
Yann Dubois
Zhao Meng
```

Here is the full list of all changes:

### Added

- Added the `event_name` argument for `LRScheduler` for optional recording of LR changes inside `net.history`. NOTE: Supported only in Pytorch>=1.4
- Make it easier to add custom modules or optimizers to a neural net class by automatically registering them where necessary and by making them available to set_params
- Added the `step_every` argument for `LRScheduler` to set whether the scheduler step should be taken on every epoch or on every batch.
- Added the `scoring` module with `loss_scoring` function, which computes the net's loss (using `get_loss`) on provided input data.
- Added a parameter `predict_nonlinearity` to `NeuralNet` which allows users to control the nonlinearity to be applied to the module output when calling `predict` and `predict_proba` (#637, #661)
- Added the possibility to save the criterion with `save_params` and with checkpoint callbacks
- Added the possibility to save custom modules with `save_params` and with checkpoint callbacks

### Changed

- Removed support for schedulers with a `batch_step()` method in `LRScheduler`.
- Raise `FutureWarning` in `CVSplit` when `random_state` is not used. Will raise an exception in a future (#620)
- The behavior of method `net.get_params` changed to make it more consistent with sklearn: it will no longer return "learned" attributes like `module_`; therefore, functions like `sklearn.base.clone`, when called with a fitted net, will no longer return a fitted net but instead an uninitialized net; if you want a copy of a fitted net, use `copy.deepcopy` instead;`net.get_params` is used under the hood by many sklearn functions and classes, such as `GridSearchCV`, whose behavior may thus be affected by the change. (#521, #527)
- Raise `FutureWarning` when using `CyclicLR` scheduler, because the default behavior has changed from taking a step every batch to taking a step every epoch. (#626)
- Set train/validation on criterion if it's a PyTorch module (#621)
- Don't pass `y=None` to `NeuralNet.train_split` to enable the direct use of split functions without positional `y` in their signatures. This is useful when working with unsupervised data (#605).
- `to_numpy` is now able to unpack dicts and lists/tuples (#657, #658)
- When using `CrossEntropyLoss`, softmax is now automatically applied to the output when calling `predict` or `predict_proba`

### Fixed

- Fixed a bug where `CyclicLR` scheduler would update during both training and validation rather than just during training.
- Fixed a bug introduced by moving the `optimizer.zero_grad()` call outside of the train step function, making it incompatible with LBFGS and other optimizers that call the train step several times per batch (#636)
- Fixed pickling of the `ProgressBar` callback (#656)
@BenjaminBossan
Copy link
Collaborator

@ToddMorrill any updates?

@BenjaminBossan
Copy link
Collaborator

Since there haven't been any updates for quite a while, I assume this has been resolved. Feel free to re-open if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants