Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(workflow): Implement a simplified CoAct workflow #3770

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

ryanhoangt
Copy link
Contributor

@ryanhoangt ryanhoangt commented Sep 7, 2024

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG

  • This PR implements a simplified multi-agent workflow inspired by the CoAct paper.
  • Currently, in swe-bench eval, there are complex instances that OpenHands fails, especially ones that single CodeActAgent overlooks the buggy location. If we have a grounding test case for the issue, this workflow seems to help.
  • An overkill-ish successful trajectory with replanning can be found here.
  • A task which CoActPlannerAgent finished but CodeActAgent failed (I expected both to be able to complete it):

Give a summary of what the PR does, explaining any non-trivial design decisions

  • Modify CodeAct to make it accept delegated task.
  • Implement 2 new agents, planner and executor with the same abilities as CodeAct, different system prompts, additional action parsers.
  • Nit: adjust the delegate message shown on UI.

Some next steps to improve this may be:

  • Try eval on some swe-bench-lite instances.
  • Adjust the system/user prompts and few-shot examples to further specialize the two agents. Also define the structure for the plan (e.g., its components, etc). 2 agents can now cooperate to finish a swe-bench issue.
  • Use meta prompt to reinforce the actions of agents, to make sure it follows the workflow.
  • Experiment with ability for the global planner to refuse the replan request from executor.
  • Implement ability for the delegated agent (e.g., BrowsingAgent or CoActExecutorAgent) to persist its history through the multi-turn conversation.

Link of any specific issues this addresses

#3077

@tobitege
Copy link
Collaborator

tobitege commented Sep 7, 2024

just fyi, the integration tests seem to fail because of some

"action_suffix": "browse"

in some results.

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 7, 2024

just fyi, the integration tests seem to fail because of some

"action_suffix": "browse"

in some results.

Thanks, still waiting for reviews on it, if it is good to go I will look into the tests.

@neubig neubig self-requested a review September 8, 2024 13:32
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks a bunch for this @ryanhoangt !

I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.

@ryanhoangt
Copy link
Contributor Author

Hey, thanks a bunch for this @ryanhoangt !

I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.

Thanks Prof., I'll do and update how it goes soon.

@tobitege
Copy link
Collaborator

It might be in the paper(s), but I don't quite like that the prompts now talk of agent, while anywhere else it is assistant. 🤔

'edit_file_by_replace',
'insert_content_at_line',
'append_file',
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be out of scope for this PR, but this list could maybe defined as an agentskills_editing_methods list in the __init__ of the agentskills, so you don't have to hard-code them here? Just as an idea.

{% set DEFAULT_EXAMPLE %}
--- START OF EXAMPLE 1 ---

USER: I want to go to the user 'Every-Hands-AI' github profile using browser and check how many repos they have, then write the number to the new 'repo_count.txt' file. Use a plan that has the least number of steps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every-Hands-AI? Is that a competitor product? 🤣

@@ -70,7 +70,7 @@ def parse(self, action_str: str) -> Action:
self.finish_command is not None
), 'self.finish_command should not be None when parse is called'
thought = action_str.replace(self.finish_command.group(0), '').strip()
return AgentFinishAction(thought=thought)
return AgentFinishAction(thought=thought, outputs={'content': thought})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I haven't looked at its implementation yet, but is the thought really required in 2 attributes or is this only needed for the purpose of the CoAct? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pass it into the outputs argument, otherwise the delegator wouldn't know the result of the execution:

formatted_output = ', '.join(
f'{key}: {value}' for key, value in outputs.items()
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@li-boxuan - what do you think about this within the delegation flow?
We haven't had this before, so I'm wondering if this is only needed/wanted in "CoAct"-mode.

@tobitege
Copy link
Collaborator

Great job! Sorry for all the nits, skip the ones that seem too picky, of course. 😄
As an extra thought: the 2 system prompts' examples look pretty much the same. Ideally if those could be a partial template or applied via code, it could reduce the duplication a little, maybe as a follow up PR?

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 15, 2024

After running the eval on a subset of 93 instances, it's quite disappointing to see that the performance for now is pretty bad 😢. CoAct only resolved 25/93 while CodeAct resolved 39/93.

Below are some plot of both's results:

Another thing I noticed is the list of medium issues that each resolved are pretty different:

I'm not sure if we should add this agent into agenthub and improve it in future PRs?

@tobitege
Copy link
Collaborator

Not sure if we can add this agent into agenthub and improve it in future PRs?

Thanks so much for running the eval on this! From the outside, it is a bit surprising that it is not at least equal in performance, imo.
If you could, share with us the trajectory (if not already done) in e.g. Slack, we should do some investigation into the failed instances?
Sometimes it's just one, small thing being responsible for multiple fails.

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 15, 2024

From the outside, it is a bit surprising that it is not at least equal in performance, imo.

It's quite unexpected to me as well. I will upload the trajectory to the visualizer shortly.

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 15, 2024

@ketan1741 @tobitege I uploaded the trajectory to my viz here, as uploading a subset of eval on OpenHands's official space may confuse people. Maybe we can have a look to see if there are any obvious improvements to make.

One thing to note is because currently the eval process doesn't capture the trajectory for delegation, we can't see how the executor performed on the visualization. We can address this before running a new eval.

@enyst
Copy link
Collaborator

enyst commented Sep 16, 2024

One thing to note is because currently the eval process doesn't capture the trajectory for delegation, we can't see how the executor performed on the visualization. We can address this before running a new eval.

PR for that, untested at the moment: #3881

@jatinganhotra
Copy link

@ryanhoangt I read the CoAct paper today and found this Open PR. Thanks for implementing the approach as the idea is quite interesting and in theory should improve the overall performance.

I'm also a bit surprised to see the lower performance and CoAct approach failing on instances where a standalone CodeActAgent resolves the instance.

I'm not sure if we should add this agent into agenthub and improve it in future PRs?

Would it be possible to run eval on all 300 instances? I'd love to chat further on AllHands Slack as I'm exploring something similar and this idea has a lot of potential.

@enyst
Copy link
Collaborator

enyst commented Sep 16, 2024

@ryanhoangt @ketan1741
I'm debugging this a bit, and I don't know yet enough answers we need, but just so you know:

  • The PR is merged. Please see here for delegates trajectories on a simple test: HuggingFace or attached CoActPlannerAgent.zip

  • IMHO django__django-10914 looks like a fluke. It's funny... the agent has to find FILE_UPLOAD_PERMISSIONS, in the task FILE_UPLOAD_PERMISSION is used interchangeably with FILE_UPLOAD_PERMISSIONS, and the agent read the right file, but 1) on main it concluded that it has FILE_UPLOAD_PERMISSIONS and modifies it, 2) on branch it concluded it cannot find FILE_UPLOAD_PERMISSION so it has to add it and fails the task.
    In the example test above it succeeded (repeatedly) on the PR branch. 🤷

  • In part, we are making CodeAct a delegate and it's not yet ready for it. 😅 For example, a thing that happens is that it's using get_user_message() here which has to return a MessageAction. But a delegate doesn't have any MessageAction, and the method has no protection for delegates (it doesn't stop at start_id or AgentDelegateAction) so it's hijacking a MessageAction from the parent agent. This shouldn't happen (fix coming up).
    It doesn't influence anything else, but there might be other issues of this kind...?
    FWIW other delegates use something like this.

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 17, 2024

@enyst thanks for the quick fix and the insightful comments.

Re the django__django-10914 instance, it's interesting to know there're some unclear specs here, I don't know which behavior I should advocate 😅. But I think there're not much we can do, it's more to do with the LLM than the agent implementation.

we are making CodeAct a delegate and it's not yet ready for it.

Yeah I can also imagine some other weird cases that might come up, and we will need more work to make the agent (with delegation) robustly usable on the platform.

Tbh this is also not a version that I expected initially (i.e., using the grouding test case to help the agent resolve the issue) and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps. Should I close this PR for now?


Hi @jatinganhotra, regarding the full evaluation, I think it will likely further lower the score, as the remaining instances are not included in swe-bench-verified. IMO It would be more beneficial to run the full eval after validating the agent gets good scores on this "verified" subset.

On All Hands's Slack we have a dedicated channel named "workflows" to discuss about this topic. We'd love to chat more about improvement ideas there.

@enyst
Copy link
Collaborator

enyst commented Sep 17, 2024

Tbh this is also not a version that I expected initially and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps.

I also hope that understanding and debugging swe-bench issues will get easier after PR 3808 (simpler execution flow in the backend, fix delegates) and PR 3413 (logging prompts).

Should I close this PR for now?

Sorry, that's not what I meant. Up to you, of course, but IMHO it's worth looking into it a bit more, fixing stuff around what happens. It is quite amazing that it solves 6 more than main, but for some reason we still don't know it breaks in 9. FWIW the 2 tests I worked with were from those 9 that pass on main but failed on branch, and they passed every time.

Two minor thoughts:

  • the PR sends the new actions as serialized in json, but CodeAct doesn't do that usually, it sends strings as in a dialogue with the llm. It may be less confusing to the LLM if we are consistent and add the new ones around here in a similar format, extracting the useful info. Small note, Claude doesn't do as well with JSON as GPT-4o in my experience, not because it can't but it seems to need some prompting or gets confused at times.
  • what do you think, what if we give Claude the prompt pieces, and ask it to rephrase for clarity, precision, and all info needed?

One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent or show that it doesn't affect its performance. Just my thought, not sure how others think about it?

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 17, 2024

Sounds good, I can do the debugging in the next few days and try running a new eval to obtain the trajectory of the executor. In previous eval, bugs such as missing the </execute...> tag in the stop sequences also negatively affected the performance, as it made the planner to execute the plan itself. Also happy to hear other ideas on improving this!


Also just to clarify a bit, the 6/9 diff above are for instances at medium (15m-1h) level. There's also a diagram for instances at easy (<15m) level (the 2 instances you ran is in the red subset)
Venn Diagram of Resolved Easy Instance IDs

I looked into the trajectory in this easy level, one observation that make CoAct failed on many instances is that the planner very often came up with plans that have more steps than needed (often the plan contained 3 phases but only 1 of them was actually required to resolve the issue).


Some questions:

the PR sends the new actions as serialized in json

Can you clarify this? I'm not sure I did any json serialization in the PR...

One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent

I think there're no changes to CodeAct execution in this PR, except the message property of AgentDelegateAction (which CodeAct should not use in swe-bench). Can you point out some changes that you're concerned?

what do you think, what if we give Claude the prompt pieces, and ask it to rephrase for clarity, precision, and all info needed?

Can you give an example of this?

@enyst
Copy link
Collaborator

enyst commented Sep 17, 2024

Can you clarify this? I'm not sure I did any json serialization in the PR...

🤔 I guess it could be that it's the default serialization, so if they're not added at that point in CodeAct, maybe it takes it by default? Not fully sure, but anyway I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:

Action
{
"id":20
"timestamp":"2024-09-16T19:46:29.172006"
"source":"agent"
"message":"The new test method has been added successfully. Now all three phases of the task are complete. Let's summarize what we've done:

1. Modified the FILE_UPLOAD_PERMISSIONS setting in global_settings.py to 0o644.
....

@enyst
Copy link
Collaborator

enyst commented Sep 17, 2024

I attach a couple of files that are very strange to me. They're a few of the prompts sent to the LLM during django__django-11964.
prompts.zip

prompt_039.py is the first prompt to Planner after the Executor returns with success. It has an observation using JSON. The last lines:

...
</expected_content>
</execute_global_plan>

----------

OBSERVATION:
{'content': "Thank you for the update. The changes for all three phases have been successfully implemented. Let's review the modifications:\n\n1. We added the `__str__` method to the `TextChoices` class.\n2. We added the `__str__` method to the `IntegerChoices` class.\n3. We added the `__str__` method to the base `Choices` class.\n\nThese changes should address the issue of differing types for TextChoices and IntegerChoices fields. The `__str__` method in each class now ensures that the string representation of the enum value is returned.\n\nTo summarize the changes made:\n\n1. In the `Choices` base class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\n2. In the `IntegerChoices` class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\n3. In the `TextChoices` class:\n   ```python\n   def __str__(self):\n       return str(self.value)\n   ```\n\nThese modifications should resolve the issue of differing types for TextChoices and IntegerChoices fields. The value returned by the getter of the field will now be of the same type as the one created by enum.Enum (enum value).\n\nThe task is now complete."}


ENVIRONMENT REMINDER: You have 21 turns left to complete the task. When finished reply with <finish></finish>.

There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt:

prompt_040.py

You are an autonomous intelligent programming agent playing the role of a subordinate employee...
...
NOW, LET'S START!
The user message is: "Now, let's come up with 2 global plans sequentially.
- First, examine the codebase and locate the relevant code for the issue. Then we'll come up with the FIRST detailed plan with all the edits to resolve it.
- After the local agent finishes executing the first plan, navigate the codebase again and come up with the SECOND detailed plan to create exactly ONE unit test at the correct location to verify the change has actually resolved the issue. As the LAST phase, explicitly tell the executor to execute it after creating that test case. If the test failed and after debugging it the local executor believes the previous fixes are incorrect, request for a new plan and include the error with explanation for that request."

# Phases
## Phase 1
- description: Create a new test file for enum-related tests.
 ...
## Phase 3
- description: Execute the newly created test case.
- reason: We need to verify that our changes have resolved the issue and that the test passes.
- expected_state: The test case is executed and passes successfully.


ENVIRONMENT REMINDER: You have 20 turns left to complete the task. When finished reply with <finish></finish>.

Please correct me if I misunderstand how this should work. @ryanhoangt @ketan1741
I thought this section about "let's come up with 2 global plans sequentially" is part of the Planner agent prompt, and "playing the role of a subordinate employee" is the Executor. (Then the phases are written by the Planner for the Executor.) Isn't that the case? Does the above look expected?

@ryanhoangt
Copy link
Contributor Author

ryanhoangt commented Sep 18, 2024

I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:

Re the json in the visualizer, seems like it is because we don't format the finish action yet.

prompt_039.log - It has an observation using JSON.

Good catch, this seems to be another bug. Might be because this action is not handled properly:

return AgentFinishAction(thought=thought, outputs={'content': thought})

There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt

Yeah I also noticed this issue. My intention is to make the Planner include the full user message (hence the full problem statement in swe-bench) to give executor some more context, but sometimes it included the message from the few-shot examples, or the "Now, let's come up with 2 global plans sequentially." as you saw, which is problematic.

I thought this section about "let's come up with 2 global plans sequentially" is part of the Planner agent prompt, and "playing the role of a subordinate employee" is the Executor. (Then the phases are written by the Planner for the Executor.) Isn't that the case? Does the above look expected?

"let's come up with 2 global plans sequentially" - this is an extra piece of prompt used only in swe-bench evaluation for CoActPlanner. Similar to CodeActSWEAgent below, it can be used to steer the agent a bit to be better at a specific task, but I'm not sure the current "2 global plans" is the optimal way to go. In CodeActAgent there're many cases where the agent just fixed the issues without creating any tests.

if agent_class == 'CodeActSWEAgent':
instruction = (
'We are currently solving the following issue within our repository. Here is the issue text:\n'
'--- BEGIN ISSUE ---\n'
f'{instance.problem_statement}\n'
'--- END ISSUE ---\n\n'
)
if USE_HINT_TEXT and instance.hints_text:
instruction += (
f'--- BEGIN HINTS ---\n{instance.hints_text}\n--- END HINTS ---\n'
)
instruction += f"""Now, you're going to solve this issue on your own. Your terminal session has started and you're in the repository's root directory. You can use any bash commands or the special interface to help you. Edit all the files you need to and run any checks or tests that you want.
Remember, YOU CAN ONLY ENTER ONE COMMAND AT A TIME. You should always wait for feedback after every command.
When you're satisfied with all of the changes you've made, you can run the following command: <execute_bash> exit </execute_bash>.
Note however that you cannot use any interactive session commands (e.g. vim) in this environment, but you can write scripts and run them. E.g. you can write a python script and then run it with `python <script_name>.py`.
NOTE ABOUT THE EDIT COMMAND: Indentation really matters! When editing a file, make sure to insert appropriate indentation before each line!
IMPORTANT TIPS:
1. Always start by trying to replicate the bug that the issues discusses.
If the issue includes code for reproducing the bug, we recommend that you re-implement that in your environment, and run it to make sure you can reproduce the bug.
Then start trying to fix it.
When you think you've fixed the bug, re-run the bug reproduction script to make sure that the bug has indeed been fixed.
If the bug reproduction script does not print anything when it successfully runs, we recommend adding a print("Script completed successfully, no errors.") command at the end of the file,
so that you can be sure that the script indeed ran fine all the way through.
2. If you run a command and it doesn't work, try running a different command. A command that did not work once will not work the second time unless you modify it!
3. If you open a file and need to get to an area around a specific line that is not in the first 100 lines, say line 583, don't just use the scroll_down command multiple times. Instead, use the goto 583 command. It's much quicker.
4. If the bug reproduction script requires inputting/reading a specific file, such as buggy-input.png, and you'd like to understand how to input that file, conduct a search in the existing repo code, to see whether someone else has already done that. Do this by running the command: find_file("buggy-input.png") If that doesn't work, use the linux 'find' command.
5. Always make sure to look at the currently open file and the current working directory (which appears right after the currently open file). The currently open file might be in a different directory than the working directory! Note that some commands, such as 'create', open files, so they might change the current open file.
6. When editing files, it is easy to accidentally specify a wrong line number or to write code with incorrect indentation. Always check the code after you issue an edit to make sure that it reflects what you wanted to accomplish. If it didn't, issue another command to fix it.
[Current directory: /workspace/{workspace_dir_name}]
"""
else:
# Testing general agents
instruction = (

@enyst
Copy link
Collaborator

enyst commented Sep 18, 2024

I wonder if it's better if we include the user message we want in the Executor ourselves, rather than nudge the LLM to include it. We know exactly the snippet we want, after all.

@ryanhoangt
Copy link
Contributor Author

Yeah that makes sense, I can try doing that in the next run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent framework Strategies for prompting, agent, etc enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants