Skip to content

Latest commit

 

History

History
11 lines (10 loc) · 1.13 KB

dataprep.md

File metadata and controls

11 lines (10 loc) · 1.13 KB

Data Preparation for BERT Pretraining

The following steps are to prepare Wikipedia corpus for pretraining. However, these steps can be used with little or no modification to preprocess other datasets as well:

  1. Download wiki dump file from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
  • This is a zip file and needs to be unzipped.
  1. Clone Wikiextractor, and run python wikiextractor/WikiExtractor.py -o /out -b 1000M enwiki-latest-pages-articles.xml.
  2. Run python single_line_doc_file_creation.py. This script removes html tags and empty lines and outputs to one file where each line is a paragraph.
  3. Run python sentence_segmentation.py <input_file> <output_file>. This script converts <input_file> to one file where each line is a sentence.
  4. Split the above output file into ~100 files by line with python split_data_into_files.py.
  5. From current folder (/pytorch/pretrian/dataprep), run python create_pretraining.py --input_dir=<input_directory> --output_dir=<output_directory> --do_lower_case=true which will convert each file into pickled .bin file.