Move BrownCorpus from word2vec to gensim.corpora #2956

piskvorky · 2020-09-24T09:35:23Z

Not a high-priority at all, but it'd be more sensible for such a tutorial/testing utility corpus to be implemented elsewhere - maybe under /test/ or some other data- or doc- related module – rather than in gensim.models.word2vec.

Originally posted by @gojomo in #2939 (comment)

The text was updated successfully, but these errors were encountered:

rishabhvarshney14 · 2020-09-24T11:04:45Z

Can I work on this?

piskvorky · 2020-09-24T11:09:04Z

Yes :)

We'll want to retain an alias in word2vec.py, so that people's code that relies on the current location continues to work. But the code itself should live under gensim.corpora.
Same with other such corpus classes that are currently under gensim.models – I believe there were several.

Thanks!

gojomo · 2020-09-24T17:21:45Z

The same goes for the Text8Corpus alongside it - if that class survives at all. Probably, LineSentence should be able to handle not just "more than X tokens" lines but also "all tokens on one line" - in which case, it can subsume what Text8Corpus was doing. (And then, maybe it lives in a more sensible place, like utils/corpora, instead of as a tag-on under word2vec?

rishabhvarshney14 · 2020-09-25T05:09:48Z

@piskvorky I am thinking about moving BrownCorpus to another file named browncorpus.py in corpora also there is TaggedBrownCorpus in doc2vec which I will move to browncorpus.py along with BrownCorpus and will do it for other utility corpus like Text8Corpus. Is this what should be done? Please correct me if I am wrong.

piskvorky · 2020-09-25T08:06:09Z

Yes, plus:

Add aliases to the old locations, so code that relies on the original location continues to work.
Make sure the new files have proper headers and docstrings.
Add .rst documentation for each new module here. You can use the same "stub" template as the other existing .rst files in that directory. This step ensures the new modules will appear in the public API documentation.

Do we have unit tests for those classes? If not, could you add them too please? Thanks.

gojomo · 2020-09-27T01:19:02Z

As these classes haven't had unit-tests, I wouldn't burden the simple housekeeping step (moving them to a more logical place) with the requirement to add unit tests for the 1st time.

What would it mean to unit-test a corpus, anyway? Check that it iterates to give an expected number of items/words? But what if that requires downloading extra data that isn't even in the standard gensim project? (I believe that's the case with both Brown and Text8.) Then, the "unit test" is actually adding new download code that wouldn't otherwise be present. It'd be slowing every CI test run with two new remote-downloads – which could (and almost certainly eventually will) fail due to matters entirely outside Gensim's control. And the code it's testing isn't very high-value – only used in a few demos. So I'd view creating unit texts for these classes as negative-value work: creating new costs/risks (slower tests that may generate spurious failures) for negligible value.

OTOH, if Text8Corpus's functionality could be folded into LineSentence, that new more-general corpus iterable would still feature in many tutorials/test-code. It could be (somewhat) meaningfully-tested with a small (self-contained) fragment of a long-lined corpus file (without claiming to actually be a unit-tested Text8Corpus).

piskvorky added good first issue Issue for new contributors (not required gensim understanding + very simple) housekeeping internal tasks and processes labels Sep 24, 2020

piskvorky mentioned this issue Sep 24, 2020

[MRG] *2Vec SaveLoad improvements #2939

Merged

rishabhvarshney14 mentioned this issue Oct 1, 2020

Move corpus classes/utilities to gensim.corpora #2970

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move BrownCorpus from word2vec to gensim.corpora #2956

Move BrownCorpus from word2vec to gensim.corpora #2956

piskvorky commented Sep 24, 2020

rishabhvarshney14 commented Sep 24, 2020

piskvorky commented Sep 24, 2020 •

edited

Loading

gojomo commented Sep 24, 2020

rishabhvarshney14 commented Sep 25, 2020

piskvorky commented Sep 25, 2020 •

edited

Loading

gojomo commented Sep 27, 2020

Move BrownCorpus from word2vec to gensim.corpora #2956

Move BrownCorpus from word2vec to gensim.corpora #2956

Comments

piskvorky commented Sep 24, 2020

rishabhvarshney14 commented Sep 24, 2020

piskvorky commented Sep 24, 2020 • edited Loading

gojomo commented Sep 24, 2020

rishabhvarshney14 commented Sep 25, 2020

piskvorky commented Sep 25, 2020 • edited Loading

gojomo commented Sep 27, 2020

piskvorky commented Sep 24, 2020 •

edited

Loading

piskvorky commented Sep 25, 2020 •

edited

Loading