From 343c34a3464582652a828334cb52af247e249f08 Mon Sep 17 00:00:00 2001 From: jacobfulano <62222220+jacobfulano@users.noreply.github.com> Date: Thu, 29 Jun 2023 12:51:17 -0400 Subject: [PATCH] Update README.md --- scripts/data_prep/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/data_prep/README.md b/scripts/data_prep/README.md index 014174e579..0c809159ac 100644 --- a/scripts/data_prep/README.md +++ b/scripts/data_prep/README.md @@ -1,6 +1,6 @@ # Data preparation -This folder contains scripts for converting text data from original sources (HF, JSON) to the Mosaic [StreamingDataset](https://github.com/mosaicml/streaming) format for consumption by our training scripts. +This folder contains scripts for converting text data from original sources (HF, JSON) to the Mosaic [StreamingDataset](https://github.com/mosaicml/streaming) format for consumption by our training scripts. StreamingDataset is designed to make training on large datasets from cloud storage as fast, cheap, and scalable as possible. In particular, it is custom built for multi-node, distributed training for large models while maximizing correctness guarantees, performance, and ease of use. ## Converting a pretraining dataset