diff --git a/scripts/data_prep/README.md b/scripts/data_prep/README.md index 014174e579..0c809159ac 100644 --- a/scripts/data_prep/README.md +++ b/scripts/data_prep/README.md @@ -1,6 +1,6 @@ # Data preparation -This folder contains scripts for converting text data from original sources (HF, JSON) to the Mosaic [StreamingDataset](https://github.com/mosaicml/streaming) format for consumption by our training scripts. +This folder contains scripts for converting text data from original sources (HF, JSON) to the Mosaic [StreamingDataset](https://github.com/mosaicml/streaming) format for consumption by our training scripts. StreamingDataset is designed to make training on large datasets from cloud storage as fast, cheap, and scalable as possible. In particular, it is custom built for multi-node, distributed training for large models while maximizing correctness guarantees, performance, and ease of use. ## Converting a pretraining dataset