Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reconstruct evaluation framework #33

Merged
merged 5 commits into from
Aug 6, 2024

Conversation

Kass123777
Copy link
Collaborator

@Kass123777 Kass123777 commented Aug 4, 2024

Description

  • Reconstruct evaluation framework.
  • Support vLLM and Deepspeed for generation backend.
  • Support 12 text -> text benchmarks and 1 text + image -> text benchmark.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #1314520 if this solves the issue #15213

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

Copy link
Contributor

@htlou htlou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM despite some undeleted absolute paths

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these json files needed to merge with the PR or they should be downloaded separately from huggingface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solved

@Gaiejj Gaiejj changed the title feat: reconstruct evaluation framework. feat: reconstruct evaluation framework Aug 4, 2024
@@ -0,0 +1,49 @@
#!/usr/bin/env bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line from all yaml files

default=False,
help="If True, chain-of-thought will be implemented during generation",
)
parser.add_argument("--batch_size", type=str, default=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add 'help' here

Copy link
Member

@Gaiejj Gaiejj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the above comments.

Copy link
Contributor

@htlou htlou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in this way.

Copy link
Collaborator

@XuyaoWang XuyaoWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

exit()

for k, v in unparsed_args.items():
dict_configs = update_dict(dict_configs, custom_cfgs_to_dict(k, v))
Copy link
Contributor

@cby-pku cby-pku Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dict_configs = update_dict(dict_configs, custom_cfgs_to_dict(k, v))
if v == '' or v is None:
continue
dict_configs = update_dict(dict_configs, custom_cfgs_to_dict(k, v))

@Kass123777
You should add this line in each of your vllm_eval.py of the ./evaluation/benchmarks.
The problem is described as follows:

  1. You define some input variables in the eval.sh, such as output-dir. The current logic is that if we transfer value to the hyper-parameters in the shell script, it will cover related values in the yaml files.
  2. But what if we don't transfer the value to the hyper-parameters in the shell script ? Related values will be set as None, and None will cover related values in the yaml files, leading to output path errors.
  3. So we should add a if sentence in this line, if the value is None, use the values in the yaml file.
  4. Compare your evaluatin files with training scripts, you will find that in training script, we don't define variables, so the above question will not occur.

Great Work! Looking forward to your reply!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the correction! After that, the program is more robust.

Copy link
Contributor

@cby-pku cby-pku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change related minor bugs.

def evaluator(raw_output1: List[InferenceOutput], raw_output2: List[InferenceOutput], dataloader: MTBenchDataLoader, task: str, eval_configs= None):
current_file_path = os.path.abspath(__file__)
current_dir = os.path.dirname(current_file_path)
dataset = load_dataset(task, data_files=os.path.join(current_dir, eval_configs.task_dir))[dataloader.split]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dataset = load_dataset(task, data_files=os.path.join(current_dir, eval_configs.task_dir))[dataloader.split]
dataset = load_dataset(current_dir,task)[dataloader.split]
dataset = load_dataset(current_dir,split='train',data_files='test.jsonl')

The first line: load dataset from the current directory, and use the value of dataloader.split to get dataset name, if your test prompt file name: test.jsonl, it will success, but if your test prompt files name: test_prompt.jsonl, it will fail.

The second file: a more robust dataset-load method.

Both of the above commands are correct. The original command leads to ERROR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nice of you to provide such detailed feedback on the code! I really appreciate your insights and suggestions.

merged_dict = {**resp, **eval_}
merged_list.append(merged_dict)

raw_result_file = eval_configs.output_dir+file_name+"_raw_result.jsonl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raw_result_file = eval_configs.output_dir+file_name+"_raw_result.jsonl"
os.makedirs(eval_configs.output_dir,exist_ok=True)
raw_result_file = os.path.join(eval_configs.output_dir,file_name + "_raw_result.jsonl")

Bad output file path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your insights and suggestions!

dataloader = MMLUDataLoader(dict_configs)
test_data = dataloader.load_dataset()
eval_module = MMLUGeneratorVLLM(model_config, infer_configs)
raw_outputs = eval_module.eval(test_data, eval_configs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the code. The speed for evaluating MMLU using vLLM is slow. See #35

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your question! I have replied it in #35

self.llm_trust_remote_code = self.vllm_cfgs_llm.trust_remote_code
self.llm_gpu_memory_utilization = self.vllm_cfgs_llm.gpu_memory_utilization
self.llm_max_num_seqs = self.vllm_cfgs_llm.max_num_seqs
self.llm_tensor_parallel_size = cuda_device_count_stateless()
Copy link
Contributor

@cby-pku cby-pku Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to support command line hyper-parameters, you can also change the config file in vllm_basic.json

I try to implement it by the following command. But it occurs some error :

Suggested change
self.llm_tensor_parallel_size = cuda_device_count_stateless()
tensor_ps = self.vllm_cfgs_llm.tensor_parallel_size
self.llm_tensor_parallel_size = tensor_ps if tensor_ps else cuda_device_count_stateless()
AttributeError: 'RayGPUExecutor' object has no attribute 'forward_dag'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem is caused by the same reason as #35 and will be fixed in the next version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I suggest add some experiments with Vicuna-33B, if it works, then I think it is ok.

Copy link
Contributor

@cby-pku cby-pku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Work!

Copy link
Contributor

@cby-pku cby-pku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add related descriptions about methods and updates news in the global README.md

@cby-pku cby-pku merged commit e830052 into PKU-Alignment:main Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants