mosaicml · vchiley · Jun 30, 2023 · Jun 29, 2023 · Jun 29, 2023 · Jun 29, 2023
@@ -191,7 +191,7 @@ python inference/convert_composer_to_hf.py \
 # Evaluate the model on a subset of tasks
 python eval/eval.py \
   eval/yamls/hf_eval.yaml \
-  icl_tasks=eval/yamls/tasks_light.yaml \
+  icl_tasks=eval/yamls/copa.yaml \
   model_name_or_path=mpt-125m-hf
 
 # Generate responses to prompts
@@ -206,16 +206,19 @@ python inference/hf_generate.py \
 Note: the `composer` command used above to train the model refers to [Composer](https://github.com/mosaicml/composer) library's distributed launcher.
 
 If you have a write-enabled [HuggingFace auth token](https://huggingface.co/docs/hub/security-tokens), you can optionally upload your model to the Hub! Just export your token like this:
+
 ```bash
 export HUGGING_FACE_HUB_TOKEN=your-auth-token
 ```
+
 and uncomment the line containing `--hf_repo_for_upload ...` in the above call to `inference/convert_composer_to_hf.py`.
 
 # Learn more about LLM Foundry!
 
 Check out [TUTORIAL.md](https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md) to keep learning about working with LLM Foundry. The tutorial highlights example workflows, points you to other resources throughout the repo, and answers frequently asked questions!
 
 # Contact Us
+
 If you run into any problems with the code, please file Github issues directly to this repo.
 
 If you want to train LLMs on the MosaicML platform, reach out to us at [[email protected]](mailto:[email protected])!
@@ -1,9 +1,34 @@
-# In-context learning (ICL) evaluaton
-This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluaton suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Model Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/MODEL_GAUNTLET.md) organized into 6 broad categories of competency that we expect good foundation models to have.
+# In-context learning (ICL) evaluation
 
-You can evaluate a model by preparing an evaluaton YAML following the format of the examples in the [`scripts/eval/yamls` directory](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval/yamls).
+This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Model Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/MODEL_GAUNTLET.md) organized into 6 broad categories of competency that we expect good foundation models to have.
+
+You can evaluate a model by preparing an evaluation YAML following the format of the examples in the [`scripts/eval/yamls` directory](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval/yamls).
+
+----
+
+## Quickstart
+
+To run a full evaluation on a model, you would need to install this repo and then run the following commands:
+
+<!--pytest.mark.skip-->
+```bash
+cd llm-foundry/scripts
+composer eval/eval.py eval/yamls/hf_eval.yaml
+```
+
+This will run a large eval suite, including our Model Gauntlet, on `EleutherAI/gpt-neo-125m`. You can update the model in that YAML file, or create your own, or override the values in the YAML with CLI args, such as:
+
+<!--pytest.mark.skip-->
+```bash
+cd llm-foundry/scripts
+composer eval/eval.py eval/yamls/hf_eval.yaml \
+    model_name_or_path=mosaicml/mpt-7b
+```
+
+----
+
+## Offline evaluation
 
-**Offline evaluation**
 You can run the evaluation script on a model checkpoint via `composer eval/eval.py YOUR_YAML` from the `scripts` directory or launch it on the MosaicML platform using a an MCLI YAML following the format of [`llm-foundry/mcli/mcli-1b-eval.yaml`](https://github.com/mosaicml/llm-foundry/blob/main/mcli/mcli-1b-eval.yaml).
 Your YAML must have a config section entitled `icl_tasks`, this can either be a list of dictionaries of the form
 
@@ -27,17 +52,17 @@ icl_tasks:
 or a local path pointing to a YAML containing an icl\_tasks config.
 
 Note that if continuation\_delimiter, example\_delimiter, or prompt\_string are omitted they will default to the values below:
+
 ```jsx
 continuation_delimiter: ' '
 example_delimiter: "\n"
 prompt_string: ''
 ```
 
+## Evaluation during training
 
-**Evaluation during training**
 You can also add ICL evaluation to your training runs by adding an `icl_tasks` config to your training config at the same depth as the `model` subconfig.
 
-
 ----
 
 ## ICL Tasks
@@ -59,7 +84,7 @@ Composer currently supports four ICL formats
 3. [InContextLearningMultipleChoiceTaskDataset](https://github.com/mosaicml/composer/blob/v0.14.0/composer/datasets/in_context_learning_evaluation.py#L405-L599)
 4. [InContextLearningSchemaTaskDataset](https://github.com/mosaicml/composer/blob/v0.14.0/composer/datasets/in_context_learning_evaluation.py#L602-L773)
 
---------
+----
 
 ### InContextLearningQATaskDataset
 

@@ -5,6 +5,7 @@
 import re
 import sys
 import time
+import traceback
 from typing import List
 
 import pandas as pd
@@ -166,8 +167,10 @@ def main(cfg):
             print(models_df.to_markdown(index=False))
         except Exception as e:
             print(
-                f'Got exception: {str(e)} while evaluating {model_cfg}. Continuing to next model.',
+                f'Got exception: {str(e)} while evaluating {model_cfg}. Traceback:',
                 flush=True)
+            traceback.print_exc()  # print the exception to stdout
+            print('\nContinuing to next model.\n', flush=True)
 
 
 def calculate_markdown_results(logger_keys, logger_data, benchmark_to_taxonomy,

@@ -0,0 +1,6 @@
+icl_tasks:
+-
+  label: copa
+  dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
+  num_fewshot: [0]
+  icl_task_type: multiple_choice
@@ -2,21 +2,35 @@ max_seq_len: 2048
 seed: 1
 precision: amp_fp16
 
+# If you are using one model, put it here:
+model_name_or_path: EleutherAI/gpt-neo-125m
+# otherwise, write a block for each model you want to test in the `models` section
 
 models:
 -
-  model_name: EleutherAI/gpt-neo-125m
+  model_name: ${model_name_or_path}
   model:
     name: hf_causal_lm
-    pretrained_model_name_or_path: EleutherAI/gpt-neo-125m
+    pretrained_model_name_or_path: ${model_name_or_path}
     init_device: cpu
     pretrained: true
   tokenizer:
-    name: EleutherAI/gpt-neo-125m
+    name: ${model_name_or_path}
     kwargs:
       model_max_length: ${max_seq_len}
+# # if you are evaluating more than one model, list them all as YAML blocks without variable interpolation
+# -
+#   model_name: mosaicml/mpt-7b
+#   model:
+#     name: hf_causal_lm
+#     pretrained_model_name_or_path: mosaicml/mpt-7b
+#     init_device: cpu
+#     pretrained: true
+#   tokenizer:
+#     name: mosaicml/mpt-7b
+#     kwargs:
+#       model_max_length: ${max_seq_len}
 
-# load_path: # Add your (optional) Composer checkpoint path here!
 
 device_eval_batch_size: 4
 

@@ -1,40 +1,40 @@
 icl_tasks:
 -
   label: lambada_openai
-  dataset_uri: eval/local_data/language_understanding/lambada_openai.jsonl
+  dataset_uri: eval/local_data/language_understanding/lambada_openai.jsonl  # or use your own dataset URI
   num_fewshot: [0]
   icl_task_type: language_modeling
 -
   label: piqa
-  dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl  # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
   continuation_delimiter: "\nAnswer: " # this separates questions from answers
 -
   label: hellaswag
-  dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
 -
   label: arc_easy
-  dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
-  continuation_delimiter: "\nAnswer: " # this separates questions from answers
+  continuation_delimiter: "\nAnswer: "
 -
   label: arc_challenge
-  dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
-  continuation_delimiter: "\nAnswer: " # this separates questions from answers
+  continuation_delimiter: "\nAnswer: "
 -
   label: copa
-  dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
   num_fewshot: [0]
   icl_task_type: multiple_choice
 -
   label: boolq
-  dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl # ADD YOUR OWN DATASET URI
+  dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl
   num_fewshot: [10]
   icl_task_type: multiple_choice
-  continuation_delimiter: "\nAnswer: " # this separates questions from answers
+  continuation_delimiter: "\nAnswer: "