mosaicml · vchiley · Jun 30, 2023 · Jun 29, 2023 · Jun 29, 2023 · Jun 29, 2023
@@ -1,6 +1,6 @@
 # Data preparation
 
-This folder contains scripts for converting text data from original sources (HF, JSON) to [StreamingDataset](https://github.com/mosaicml/streaming) format for consumption by our training scripts.
+This folder contains scripts for converting text data from original sources (HF, JSON) to the Mosaic [StreamingDataset](https://github.com/mosaicml/streaming) format for consumption by our training scripts. StreamingDataset is designed to make training on large datasets from cloud storage as fast, cheap, and scalable as possible. In particular, it is custom built for multi-node, distributed training for large models while maximizing correctness guarantees, performance, and ease of use.
 
 
 ## Converting a pretraining dataset

@@ -1,11 +1,25 @@
-# LLM Pretraining
+# LLM Pretraining <a name="llmpretraining"></a>
 
-## Installation
+The following tutorial walks through pretraining and finetuning a large language model using MosaicML's StreamingDataset format, Composer trainer, and MPT architecture. When used in concert on high-performance hardware such as A100 GPUs, these tools enable incredibly efficient and optimized LLM training. 
+
+#### Table of Contents
+1. [LLM Pretraining](#llmpretraining)
+   1. [Installation](#installation)
+   2. [Dataset Preparation](#datasetpreparation)
+   3. [How to start single and multi-node pretraining](#howtostartpretraining)
+2. [LLM Finetuning](#llmfinetuning)
+   1. [Using a dataset on the HuggingFace Hub](#hfdataset)
+   2. [Using a local dataset](#localdataset)
+   3. [Using a StreamingDataset (MDS) formatted dataset locally or in an object store](#mdsdataset)
+3. [How many GPUs do I need to train a LLM?](#howmandygpus)
+4. [Optimizing Performance](#optimizingperformance)
+
+## Installation <a name="installation"></a>
 
 If you haven't already, make sure to [install the requirements](../../README.md#Installation).
 
-## Dataset preparation
-To run pretraining, you'll need to make yourself a copy of a pretraining dataset. Check out the `llm-foundry/data_prep` folder for detailed instructions.
+## Dataset preparation <a name="datasetpreparation"></a>
+To run pretraining, you'll need to make yourself a copy of a pretraining dataset and format it for efficient streaming. Check out the `llm-foundry/data_prep` folder for detailed instructions on how to convert your dataset to the MosaicML [StreamingDataset](https://github.com/mosaicml/streaming) format.
 
 As a quickstart, here is how to prepare the [C4: Colossal, Cleaned, Common Crawl dataset](https://huggingface.co/datasets/c4).
 We first convert the dataset from its native format (a collection of zipped JSONs)
@@ -17,8 +31,11 @@ You can read more about the benefits of using mosaicml-streaming [here](https://
 NOTE: If you only want to profile these LLMs, we recommend that you **download and prepare the `train_small` and `val_small` splits**,
 and skip the full `train` and `val` splits. You'll just need to replace `split: train` with `split: train_small`
 and `split: val` with `split: val_small` in your run YAML's dataloader config.
-You can also accomplish this in your CLI command like so: `composer train.py ... train_loader.dataset.split=train_small eval_loader.dataset.split=val_small`
-Alternatively, feel free to substitute our dataloader with one of your own in `train.py`.
+You can also accomplish this in your CLI command like so: 
+```bash
+composer train.py ... train_loader.dataset.split=train_small eval_loader.dataset.split=val_small
+```
+where the `composer` command used above to train the model refers to [Composer library's](https://github.com/mosaicml/composer) distributed launcher. Alternatively, feel free to substitute our dataloader with one of your own in `train.py`.
 
 ### Converting C4 to streaming dataset `.mds` format
 To make yourself a copy of C4, use `convert_dataset_hf.py` like so:
@@ -56,7 +73,7 @@ python ../../llmfoundry/data/text_data.py --local_path /tmp/cache-c4 --remote_pa
 # python ../data_prep/text_data.py --local_path /tmp/cache-c4 --remote_path s3://my-bucket/my-copy-c4  # stream from object store
 ```
 
-## How to start training
+## How to start single and multi-node pretraining <a name="howtostartpretraining"></a>
 
 Now that you've installed dependencies and built a local copy of the C4 dataset, let's start training!
 
@@ -77,7 +94,7 @@ If training on a single node, the `composer` launcher will autodetect the number
 composer train.py yamls/pretrain/mpt-125m.yaml train_loader.dataset.split=train_small eval_loader.dataset.split=val_small
 ```
 
-To train with high performance on multi-node clusters, the easiest way is with the MosaicML platform ;) Check out the `mcli/` folder for examples!
+To train with high performance on multi-node clusters, the easiest way is with the [MosaicML platform](https://www.mosaicml.com/training) ;) Check out the `mcli/` folder for examples!
 
 But if you really must try this manually on your own cluster, then just provide a few variables to `composer`
 either directly via CLI, or via environment variables that can be read. Then launch the appropriate command on each node:
@@ -146,7 +163,7 @@ by using [Composer's logging integrations](https://docs.mosaicml.com/projects/co
 ```
 
 
-# LLM Finetuning
+# LLM Finetuning <a name="llmfinetuning"></a>
 
 This repo also contains utilities for Seq2Seq finetuning for LLMs, for example, Supervised Finetuning (SFT) (aka Instruction(Fine)Tuning (IFT)), or finetuning a base LLM to focus on a specific task like summarization.
 
@@ -155,7 +172,7 @@ If you are unfamiliar with that script, or the LLM-Foundry in general, you shoul
 
 ## If you want to finetune MPT-7B
 
-You should probably start with ``yamls/finetune/mpt-7b_dolly_sft.yaml`
+You should probably start with `yamls/finetune/mpt-7b_dolly_sft.yaml`
 
 ## Data formatting
 
@@ -231,7 +248,8 @@ For this example, let's say we add this function to a file that we can import fr
 ## Usage
 
 Now we'll cover the different ways you can use the finetuning utilities. This will mostly focus on how to configure your YAML, assuming you have already prepared any custom preprocessing functions as described above.
-### **1) Using a dataset on the HuggingFace Hub**
+
+### **1) Using a dataset on the HuggingFace Hub** <a name="hfdataset"></a>
 
 Let's say you want to finetune using a dataset available on the HuggingFace Hub.
 If the dataset has a [pre-defined preprocessing function](#pre-defined-preprocessing-functions), e.g., `tatsu-lab/alpaca`, or if the dataset already has the "prompt"/"response" format, simply point the dataloader to that dataset.
@@ -255,7 +273,7 @@ train_loader:
         ...
 ```
 
-### **2) Using a local dataset**
+### **2) Using a local dataset** <a name="localdataset"></a>
 
 Let's say you have your finetuning dataset stored in local `jsonl` files.
 Reference this in your YAML, such as the one in `yamls/finetune/1b_local_data_sft.yaml`
@@ -272,7 +290,7 @@ train_loader:
 ```
 As before, if your local dataset already has the "prompt"/"response" format, you don't need to include `preprocessing_fn` since no preprocessing is needed.
 
-### **3) Using an MDS-formatted (streaming) dataset -- locally or in an object store**
+### **3) Using a StreamingDataset (MDS) formatted dataset locally or in an object store** <a name="mdsdataset"></a>
 
 To enable streaming, you must first use the `convert_finetuning_dataset.py` script to convert a HuggingFace dataset into an [MDS-formatted dataset](https://github.com/mosaicml/streaming) (which you totally should -- they're amazing).
 
@@ -307,7 +325,7 @@ train_loader:
 ```
 
 
-# How many GPUs do I need to train a LLM?
+# How many GPUs do I need to train a LLM? <a name="howmanygpus"></a>
 This is a complicated question in general, but if we assume that you are using FSDP with `FULL_SHARD`,
 activation checkpointing, and `DecoupledLionW`, then a good rule of thumb is:
 
@@ -324,7 +342,7 @@ if you use a larger cluster or devices with higher memory capacity, because this
 
 Check out our [scripts/train/benchmarking folder](./benchmarking/README.md) for detailed throughput measurements of specific model sizes on specific cluster configs!
 
-# Optimizing Performance
+# Optimizing Performance <a name="optimizingperformance"></a>
 The YAMLs in this repo are relatively well tuned for medium-to-large NVIDIA A100-40GB clusters.
 
 If you are running with a CUDA-compatible GPU and have installed the LLM requirements, we turn on by default a kernel fusion optimization for the Cross Entropy loss function at the end of the model.
@@ -341,7 +359,7 @@ so you should be able to run the exact same YAML on 8 or 16 or 256 GPUs and get
 This is nice because it means you can write device-count-agnostic training configs,
 and not worry about OOM-ing or accidentally changing the optimization math.
 
-In previous blogs ([1](https://www.mosaicml.com/blog/farewell-oom), [2](https://www.mosaicml.com/blog/billion-parameter-gpt-training-made-easy))
+In previous blogposts ([1](https://www.mosaicml.com/blog/farewell-oom), [2](https://www.mosaicml.com/blog/billion-parameter-gpt-training-made-easy))
 we also demonstrated auto microbatching, which takes things a step further by letting Composer determine the `device_train_microbatch_size` on its own.
 This makes our configs not only device-count-agnostic, but hardware-agnostic too!
 You can try out this feature by setting `device_train_microbatch_size: auto`, but bear in mind that FSDP support is still in alpha mode