-
Notifications
You must be signed in to change notification settings - Fork 6
[WIP] feat: enable swe_bench for API-hosted models #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| inspect-evals = { git = "https://github.com/UKGovernmentBEIS/inspect_evals" } | ||
| aviato = { path = "../../../aviato-client", editable = true } | ||
| inspect-aviato-sandbox = { path = "../../../inspect_aviato_sandbox", editable = true } | ||
| inspect-evals = { path = "../../../inspect_evals", editable = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires this branch (and fork) https://github.com/iiilisan/inspect_evals/tree/swebench/epoch-refactor
| [tool.uv.sources] | ||
| inspect-evals = { git = "https://github.com/UKGovernmentBEIS/inspect_evals" } | ||
| aviato = { path = "../../../aviato-client", editable = true } | ||
| inspect-aviato-sandbox = { path = "../../../inspect_aviato_sandbox", editable = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires the chunked read/writes from https://github.com/coreweave/inspect_aviato_sandbox/tree/npun-read-write-chunks, until https://github.com/coreweave/aviato-client/issues/31 is resolved
| # sandbox_type="docker", | ||
| sandbox_config=create_aviato_sandbox_spec_with_env, | ||
| arch="x86_64", | ||
| solver=swe_bench_react_agent(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to set the solver here until https://github.com/coreweave/aviato-client/issues/32 is resolved.
swe_bench_agent_with_inspect_tool_support is the current solver in inspect_evals (see https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/src/inspect_evals/swe_bench/swe_bench_tasks.py#L286-L305) and requires bidirectional communication to enable the bash_sessions and text_editor tools.
swe_bench_react_agent fires one-off bash commands that conforms to our current exec paradigm
| instance_ids=[ | ||
| "astropy__astropy-12907", | ||
| # "astropy__astropy-14182", | ||
| # "astropy__astropy-14365", | ||
| # "astropy__astropy-14995", | ||
| # "astropy__astropy-6938", | ||
| # "astropy__astropy-7746", | ||
| # "django__django-10914", | ||
| # "django__django-10924", | ||
| # "django__django-11001", | ||
| # "django__django-11019", | ||
| ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded for testing, i wonder if we'll want to allow users to select which instance_ids they want to run?
This can be run locally with the following requirements:
aviato-clientinspect_aviato_sandboxand switch tonpun-read-write-chunksinspect_evalsand switch toswebench/epoch-refactorThen:
jobs/inspect_ai_evalspyproject.tomlcoreweave_mlorg in order to access the aviato sandboxes