-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(workflow): Implement a simplified CoAct workflow #3770
base: main
Are you sure you want to change the base?
Conversation
just fyi, the integration tests seem to fail because of some
in some results. |
Thanks, still waiting for reviews on it, if it is good to go I will look into the tests. |
2890cc5
to
fc44e17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks a bunch for this @ryanhoangt !
I browsed through the code, and I think it's implemented quite well. Personally I think the next step could be to test if it gets good benchmark scores.
Thanks Prof., I'll do and update how it goes soon. |
It might be in the paper(s), but I don't quite like that the prompts now talk of |
'edit_file_by_replace', | ||
'insert_content_at_line', | ||
'append_file', | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be out of scope for this PR, but this list could maybe defined as an agentskills_editing_methods
list in the __init__
of the agentskills, so you don't have to hard-code them here? Just as an idea.
{% set DEFAULT_EXAMPLE %} | ||
--- START OF EXAMPLE 1 --- | ||
|
||
USER: I want to go to the user 'Every-Hands-AI' github profile using browser and check how many repos they have, then write the number to the new 'repo_count.txt' file. Use a plan that has the least number of steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every-Hands-AI? Is that a competitor product? 🤣
@@ -70,7 +70,7 @@ def parse(self, action_str: str) -> Action: | |||
self.finish_command is not None | |||
), 'self.finish_command should not be None when parse is called' | |||
thought = action_str.replace(self.finish_command.group(0), '').strip() | |||
return AgentFinishAction(thought=thought) | |||
return AgentFinishAction(thought=thought, outputs={'content': thought}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I haven't looked at its implementation yet, but is the thought
really required in 2 attributes or is this only needed for the purpose of the CoAct? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pass it into the outputs
argument, otherwise the delegator wouldn't know the result of the execution:
OpenHands/openhands/controller/agent_controller.py
Lines 394 to 396 in 57390eb
formatted_output = ', '.join( | |
f'{key}: {value}' for key, value in outputs.items() | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@li-boxuan - what do you think about this within the delegation flow?
We haven't had this before, so I'm wondering if this is only needed/wanted in "CoAct"-mode.
Great job! Sorry for all the nits, skip the ones that seem too picky, of course. 😄 |
Thanks so much for running the eval on this! From the outside, it is a bit surprising that it is not at least equal in performance, imo. |
It's quite unexpected to me as well. I will upload the trajectory to the visualizer shortly. |
@ketan1741 @tobitege I uploaded the trajectory to my viz here, as uploading a subset of eval on OpenHands's official space may confuse people. Maybe we can have a look to see if there are any obvious improvements to make. One thing to note is because currently the eval process doesn't capture the trajectory for delegation, we can't see how the executor performed on the visualization. We can address this before running a new eval. |
PR for that, untested at the moment: #3881 |
@ryanhoangt I read the CoAct paper today and found this Open PR. Thanks for implementing the approach as the idea is quite interesting and in theory should improve the overall performance. I'm also a bit surprised to see the lower performance and CoAct approach failing on instances where a standalone CodeActAgent resolves the instance.
Would it be possible to run eval on all 300 instances? I'd love to chat further on AllHands Slack as I'm exploring something similar and this idea has a lot of potential. |
@ryanhoangt @ketan1741
|
@enyst thanks for the quick fix and the insightful comments. Re the
Yeah I can also imagine some other weird cases that might come up, and we will need more work to make the agent (with delegation) robustly usable on the platform. Tbh this is also not a version that I expected initially (i.e., using the grouding test case to help the agent resolve the issue) and am happy with, but it seems to be a good start in this direction. I'll try to look into other instances' trajectory and explore some other variations to see if it helps. Should I close this PR for now? Hi @jatinganhotra, regarding the full evaluation, I think it will likely further lower the score, as the remaining instances are not included in On All Hands's Slack we have a dedicated channel named "workflows" to discuss about this topic. We'd love to chat more about improvement ideas there. |
I also hope that understanding and debugging swe-bench issues will get easier after PR 3808 (simpler execution flow in the backend, fix delegates) and PR 3413 (logging prompts).
Sorry, that's not what I meant. Up to you, of course, but IMHO it's worth looking into it a bit more, fixing stuff around what happens. It is quite amazing that it solves 6 more than Two minor thoughts:
One other thing that may help merging this PR is if we can keep to a minimum the changes to the default agent or show that it doesn't affect its performance. Just my thought, not sure how others think about it? |
🤔 I guess it could be that it's the default serialization, so if they're not added at that point in CodeAct, maybe it takes it by default? Not fully sure, but anyway I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like:
|
I attach a couple of files that are very strange to me. They're a few of the prompts sent to the LLM during django__django-11964. prompt_039.py is the first prompt to Planner after the Executor returns with success. It has an observation using JSON. The last lines:
There's something else that looks suspicious to me just after this. The next prompt sent to the LLM is from the Executor, and its prompt includes some text from the Planner-specific prompt: prompt_040.py
Please correct me if I misunderstand how this should work. @ryanhoangt @ketan1741 |
Re the json in the visualizer, seems like it is because we don't format the finish action yet.
Good catch, this seems to be another bug. Might be because this action is not handled properly: return AgentFinishAction(thought=thought, outputs={'content': thought})
Yeah I also noticed this issue. My intention is to make the Planner include the full user message (hence the full problem statement in swe-bench) to give executor some more context, but sometimes it included the message from the few-shot examples, or the "Now, let's come up with 2 global plans sequentially." as you saw, which is problematic.
"let's come up with 2 global plans sequentially" - this is an extra piece of prompt used only in swe-bench evaluation for CoActPlanner. Similar to OpenHands/evaluation/swe_bench/run_infer.py Lines 248 to 290 in 41ddba8
|
I wonder if it's better if we include the user message we want in the Executor ourselves, rather than nudge the LLM to include it. We know exactly the snippet we want, after all. |
Yeah that makes sense, I can try doing that in the next run |
Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG
CodeActAgent
overlooks the buggy location. If we have a grounding test case for the issue, this workflow seems to help.CoActPlannerAgent
finished butCodeActAgent
failed (I expected both to be able to complete it):Give a summary of what the PR does, explaining any non-trivial design decisions
Some next steps to improve this may be:
swe-bench-lite
instances.BrowsingAgent
orCoActExecutorAgent
) to persist its history through the multi-turn conversation.Link of any specific issues this addresses
#3077