Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval Quickstart #398

Merged
merged 10 commits into from
Jun 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ python inference/convert_composer_to_hf.py \
# Evaluate the model on a subset of tasks
python eval/eval.py \
eval/yamls/hf_eval.yaml \
icl_tasks=eval/yamls/tasks_light.yaml \
icl_tasks=eval/yamls/copa.yaml \
model_name_or_path=mpt-125m-hf

# Generate responses to prompts
Expand All @@ -206,16 +206,19 @@ python inference/hf_generate.py \
Note: the `composer` command used above to train the model refers to [Composer](https://github.com/mosaicml/composer) library's distributed launcher.

If you have a write-enabled [HuggingFace auth token](https://huggingface.co/docs/hub/security-tokens), you can optionally upload your model to the Hub! Just export your token like this:

```bash
export HUGGING_FACE_HUB_TOKEN=your-auth-token
```

and uncomment the line containing `--hf_repo_for_upload ...` in the above call to `inference/convert_composer_to_hf.py`.

# Learn more about LLM Foundry!

Check out [TUTORIAL.md](https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md) to keep learning about working with LLM Foundry. The tutorial highlights example workflows, points you to other resources throughout the repo, and answers frequently asked questions!

# Contact Us

If you run into any problems with the code, please file Github issues directly to this repo.

If you want to train LLMs on the MosaicML platform, reach out to us at [[email protected]](mailto:[email protected])!
39 changes: 32 additions & 7 deletions scripts/eval/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,34 @@
# In-context learning (ICL) evaluaton
This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluaton suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Model Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/MODEL_GAUNTLET.md) organized into 6 broad categories of competency that we expect good foundation models to have.
# In-context learning (ICL) evaluation

You can evaluate a model by preparing an evaluaton YAML following the format of the examples in the [`scripts/eval/yamls` directory](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval/yamls).
This folder contains the MosaicML LLM evaluation suite. It is a [blazingly fast](https://www.mosaicml.com/blog/llm-evaluation-for-icl), multi-GPU-enabled ICL evaluation suite with native [FSDP](https://pytorch.org/docs/stable/fsdp.html) compatibility with any model on the HuggingFace hub and any PyTorch model that implements the [`ComposerModel` interface](https://docs.mosaicml.com/projects/composer/en/latest/api_reference/generated/composer.ComposerModel.html#composermodel). We also include collection of ICL datasets we refer to as our [Model Gauntlet](https://github.com/mosaicml/llm-foundry/blob/scripts/eval/local_data/MODEL_GAUNTLET.md) organized into 6 broad categories of competency that we expect good foundation models to have.

You can evaluate a model by preparing an evaluation YAML following the format of the examples in the [`scripts/eval/yamls` directory](https://github.com/mosaicml/llm-foundry/tree/main/scripts/eval/yamls).

----

## Quickstart

To run a full evaluation on a model, you would need to install this repo and then run the following commands:

<!--pytest.mark.skip-->
```bash
cd llm-foundry/scripts
composer eval/eval.py eval/yamls/hf_eval.yaml
```

This will run a large eval suite, including our Model Gauntlet, on `EleutherAI/gpt-neo-125m`. You can update the model in that YAML file, or create your own, or override the values in the YAML with CLI args, such as:

<!--pytest.mark.skip-->
```bash
cd llm-foundry/scripts
composer eval/eval.py eval/yamls/hf_eval.yaml \
model_name_or_path=mosaicml/mpt-7b
```

----

## Offline evaluation

**Offline evaluation**
You can run the evaluation script on a model checkpoint via `composer eval/eval.py YOUR_YAML` from the `scripts` directory or launch it on the MosaicML platform using a an MCLI YAML following the format of [`llm-foundry/mcli/mcli-1b-eval.yaml`](https://github.com/mosaicml/llm-foundry/blob/main/mcli/mcli-1b-eval.yaml).
Your YAML must have a config section entitled `icl_tasks`, this can either be a list of dictionaries of the form

Expand All @@ -27,17 +52,17 @@ icl_tasks:
or a local path pointing to a YAML containing an icl\_tasks config.

Note that if continuation\_delimiter, example\_delimiter, or prompt\_string are omitted they will default to the values below:

```jsx
continuation_delimiter: ' '
example_delimiter: "\n"
prompt_string: ''
```

## Evaluation during training

**Evaluation during training**
You can also add ICL evaluation to your training runs by adding an `icl_tasks` config to your training config at the same depth as the `model` subconfig.


----

## ICL Tasks
Expand All @@ -59,7 +84,7 @@ Composer currently supports four ICL formats
3. [InContextLearningMultipleChoiceTaskDataset](https://github.com/mosaicml/composer/blob/v0.14.0/composer/datasets/in_context_learning_evaluation.py#L405-L599)
4. [InContextLearningSchemaTaskDataset](https://github.com/mosaicml/composer/blob/v0.14.0/composer/datasets/in_context_learning_evaluation.py#L602-L773)

--------
----

### InContextLearningQATaskDataset

Expand Down
5 changes: 4 additions & 1 deletion scripts/eval/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import re
import sys
import time
import traceback
from typing import List

import pandas as pd
Expand Down Expand Up @@ -166,8 +167,10 @@ def main(cfg):
print(models_df.to_markdown(index=False))
except Exception as e:
print(
f'Got exception: {str(e)} while evaluating {model_cfg}. Continuing to next model.',
f'Got exception: {str(e)} while evaluating {model_cfg}. Traceback:',
flush=True)
traceback.print_exc() # print the exception to stdout
print('\nContinuing to next model.\n', flush=True)


def calculate_markdown_results(logger_keys, logger_data, benchmark_to_taxonomy,
Expand Down
6 changes: 6 additions & 0 deletions scripts/eval/yamls/copa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
icl_tasks:
-
label: copa
dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
22 changes: 18 additions & 4 deletions scripts/eval/yamls/hf_eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,35 @@ max_seq_len: 2048
seed: 1
precision: amp_fp16

# If you are using one model, put it here:
model_name_or_path: EleutherAI/gpt-neo-125m
# otherwise, write a block for each model you want to test in the `models` section

models:
-
model_name: EleutherAI/gpt-neo-125m
model_name: ${model_name_or_path}
model:
name: hf_causal_lm
pretrained_model_name_or_path: EleutherAI/gpt-neo-125m
pretrained_model_name_or_path: ${model_name_or_path}
init_device: cpu
pretrained: true
tokenizer:
name: EleutherAI/gpt-neo-125m
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
# # if you are evaluating more than one model, list them all as YAML blocks without variable interpolation
# -
# model_name: mosaicml/mpt-7b
# model:
# name: hf_causal_lm
# pretrained_model_name_or_path: mosaicml/mpt-7b
# init_device: cpu
# pretrained: true
# tokenizer:
# name: mosaicml/mpt-7b
# kwargs:
# model_max_length: ${max_seq_len}

# load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 4

Expand Down
20 changes: 10 additions & 10 deletions scripts/eval/yamls/tasks_light.yaml
Original file line number Diff line number Diff line change
@@ -1,40 +1,40 @@
icl_tasks:
-
label: lambada_openai
dataset_uri: eval/local_data/language_understanding/lambada_openai.jsonl
dataset_uri: eval/local_data/language_understanding/lambada_openai.jsonl # or use your own dataset URI
num_fewshot: [0]
icl_task_type: language_modeling
-
label: piqa
dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/piqa.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
-
label: hellaswag
dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/language_understanding/hellaswag.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
-
label: arc_easy
dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/arc_easy.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
continuation_delimiter: "\nAnswer: "
-
label: arc_challenge
dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/world_knowledge/arc_challenge.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
continuation_delimiter: "\nAnswer: "
-
label: copa
dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/commonsense_reasoning/copa.jsonl
num_fewshot: [0]
icl_task_type: multiple_choice
-
label: boolq
dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl # ADD YOUR OWN DATASET URI
dataset_uri: eval/local_data/reading_comprehension/boolq.jsonl
num_fewshot: [10]
icl_task_type: multiple_choice
continuation_delimiter: "\nAnswer: " # this separates questions from answers
continuation_delimiter: "\nAnswer: "