-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Culling by least frequent words #1045
Comments
Hi Scott! |
I may be answering your question too late at night, and I haven't looked at the The suggestion to use a negative number was a bit desperate, but prompted by the lack of screen real estate. Now that I think about it, perhaps the easiest UI solution to implement would to replace "Use the top" with a select menu that also includes "Use the bottom". (Side issue: would "most frequent/least frequent" fit instead of "top/bottom"?) Possibly a switch or toggle button would do. In other words, you can combine the UI controls, but I think culling based on most frequent and least frequent words together would be odd. Conceivably, you might choose this combined set as the "most distinctive" words in corpus, but Top Words is really the "go to" tool for that. (If you really needed to, you could download the DTM from tokenise, compile a Keep Words list and then scrub everything else out from your corpus to achieve the effect of culling everything but most and least frequent words.) I'm a little concerned by your second question because it seems to me that the same problem could potentially occur in the most frequent words scenario (although it is less likely). Perhaps this concern is just because I have not looked at how the most frequent words function is implemented, but it may be that it needs some more rigorous testing. What exactly happens if you select the 25 most frequent words for each document and no term is shared by any document. Do you keep dipping into the documents until you compile 25 terms that are shared by however many documents you have specified? (This sounds like a nightmare of an algorithm to me, but maybe somebody has done it.) Or do you just return an error message stating that none of the top 25 terms in any of the documents was shared by N documents. I think the latter is what users will expect. I'd imagine a similar procedure for least frequent words. I hope I've correctly understood your questions and provided at least some guidance. I'll try to look at the code sometime tomorrow. |
It would be really nice to create a feature that does the opposite of most frequent word culling so that people can study low-frequency words in their collections. Adding this to the back end should be pretty easy; the challenge will be the UI. I'm tempted to say that we just make the tooltip advise the user to enter a negative number (right now that generates a validation error).
The text was updated successfully, but these errors were encountered: