Skip to content

SNTSVV/wikidominer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikiDoMiner: Wikipedia Domain-specific Miner

WikiDoMiner is a tool that automatically generates domain-specific corpora by crawling Wikipedia.

Installation

Clone and install the required libraries

git clone github.com/SNTSVV/WikiDoMiner.git
cd WikiDoMiner
pip install -r requirements.txt 

Usage example

CLI:

python WikiDoMiner.py --doc Xfile.txt --output-path ../research/nlp --wiki-depth 1

checkout available arguments using

python WikiDoMiner.py --help

Run the notebook Open In Colab

# extract keywords
keywords = getKeywords(document, spacy_pipeline)

# query wikipedia to get your corpus
corpus = getCorpus(keywords, depth=1)

# locally save your corpus 
saveCorpus(corpus, parent_dir='Documents', folder='Corpus')

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

About

Wikipedia Domain-specific Miner

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published