Dependencies
- Tensorflow (Tested with v1.1)
- NLTK
Based off of Tensorflow inplementation here, which is in turn based off of PTB LSTM implementation here.
Implements noising for neural language modeling as described in this paper.
@inproceedings{noising2017,
title={Data Noising as Smoothing in Neural Network Language Models},
author={Xie, Ziang and Wang, Sida I. and Li, Jiwei and L{\'e}vy, Daniel and Nie, Aiming and Jurafsky, Dan and Ng, Andrew Y.},
booktitle={International Conference on Learning Representations (ICLR)},
year={2017}
}
The noising code can be found in loader.py
and utils.py
.
First download PTB data from here
and put in data directory. Make sure to update paths in cfg.py
to point to data.
Alternatively, you can also grab the Text8 data here, then run
the script data/text8/makedata-text8.sh
.
Then run lm.py
. Here's an example setting:
python lm.py --run_dir /tmp/lm_1500_kn --hidden_dim 1500 --drop_prob 0.65 --gamma 0.2 --scheme ngram --ngram_scheme kn --absolute_discounting