Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should the vocab file be set to when running on new documents and questions #43

Open
murphp15 opened this issue Jun 1, 2018 · 3 comments

Comments

@murphp15
Copy link

murphp15 commented Jun 1, 2018

I want to boot up a demo of this using the server.py file in this repo.
However I'm not sure what I should set the vocab file to be.
in the run_on_user_documents.py file I see you set it to be all the words in the questions the client asks. However if I don't know the words from the questions and documents up front how should I handle this?

@chrisc36
Copy link
Contributor

chrisc36 commented Jun 4, 2018

What I did for the demo is to pre-compute a file of the top-n words the occur in the TriviaQA corpus and use that.

You can set it to None, in which it will just use all the words found in the word vector file, but I think the code needs to be tweaked a little to do that since there are an absurd number of words in the Glove word vectors and there is limit to how large constant tensors can be in the graph.

@murphp15
Copy link
Author

murphp15 commented Jun 5, 2018

What script did you run to pre-compute this vocab file?

@chrisc36
Copy link
Contributor

I don't have a script I can share at the moment, but its a pretty simple task to iterate through the pre-tokenized files and count the number of words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants