-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move BrownCorpus from word2vec to gensim.corpora #2956
Comments
Can I work on this? |
Yes :) We'll want to retain an alias in Thanks! |
The same goes for the |
@piskvorky I am thinking about moving BrownCorpus to another file named browncorpus.py in corpora also there is TaggedBrownCorpus in doc2vec which I will move to browncorpus.py along with BrownCorpus and will do it for other utility corpus like Text8Corpus. Is this what should be done? Please correct me if I am wrong. |
Yes, plus:
Do we have unit tests for those classes? If not, could you add them too please? Thanks. |
As these classes haven't had unit-tests, I wouldn't burden the simple housekeeping step (moving them to a more logical place) with the requirement to add unit tests for the 1st time. What would it mean to unit-test a corpus, anyway? Check that it iterates to give an expected number of items/words? But what if that requires downloading extra data that isn't even in the standard gensim project? (I believe that's the case with both Brown and Text8.) Then, the "unit test" is actually adding new download code that wouldn't otherwise be present. It'd be slowing every CI test run with two new remote-downloads – which could (and almost certainly eventually will) fail due to matters entirely outside Gensim's control. And the code it's testing isn't very high-value – only used in a few demos. So I'd view creating unit texts for these classes as negative-value work: creating new costs/risks (slower tests that may generate spurious failures) for negligible value. OTOH, if Text8Corpus's functionality could be folded into LineSentence, that new more-general corpus iterable would still feature in many tutorials/test-code. It could be (somewhat) meaningfully-tested with a small (self-contained) fragment of a long-lined corpus file (without claiming to actually be a unit-tested |
Not a high-priority at all, but it'd be more sensible for such a tutorial/testing utility corpus to be implemented elsewhere - maybe under
/test/
or some other data- or doc- related module – rather than ingensim.models.word2vec
.Originally posted by @gojomo in #2939 (comment)
The text was updated successfully, but these errors were encountered: