Data Preparation for BERT Pretraining

The following steps are to prepare Wikipedia corpus for pretraining. However, these steps can be used with little or no modification to preprocess other datasets as well:

Download wiki dump file from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.

This is a zip file and needs to be unzipped.

Clone Wikiextractor, and run python wikiextractor/WikiExtractor.py -o /out -b 1000M enwiki-latest-pages-articles.xml.
Run python single_line_doc_file_creation.py. This script removes html tags and empty lines and outputs to one file where each line is a paragraph.
Run python sentence_segmentation.py <input_file> <output_file>. This script converts <input_file> to one file where each line is a sentence.
Split the above output file into ~100 files by line with python split_data_into_files.py.
From current folder (/pytorch/pretrian/dataprep), run python create_pretraining.py --input_dir=<input_directory> --output_dir=<output_directory> --do_lower_case=true which will convert each file into pickled .bin file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataprep.md

dataprep.md

Data Preparation for BERT Pretraining

Files

dataprep.md

Latest commit

History

dataprep.md

File metadata and controls

Data Preparation for BERT Pretraining