Skip to content

Conversation

@nicholaspun-wandb
Copy link
Contributor

@nicholaspun-wandb nicholaspun-wandb commented Jan 27, 2026

This can be run locally with the following requirements:

Then:

@nicholaspun-wandb nicholaspun-wandb self-assigned this Jan 27, 2026
@nicholaspun-wandb nicholaspun-wandb changed the title feat: enable swe_bench for API-hosted models [WIP] feat: enable swe_bench for API-hosted models Jan 27, 2026
@nicholaspun-wandb nicholaspun-wandb marked this pull request as draft January 27, 2026 22:02
inspect-evals = { git = "https://github.com/UKGovernmentBEIS/inspect_evals" }
aviato = { path = "../../../aviato-client", editable = true }
inspect-aviato-sandbox = { path = "../../../inspect_aviato_sandbox", editable = true }
inspect-evals = { path = "../../../inspect_evals", editable = true }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[tool.uv.sources]
inspect-evals = { git = "https://github.com/UKGovernmentBEIS/inspect_evals" }
aviato = { path = "../../../aviato-client", editable = true }
inspect-aviato-sandbox = { path = "../../../inspect_aviato_sandbox", editable = true }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# sandbox_type="docker",
sandbox_config=create_aviato_sandbox_spec_with_env,
arch="x86_64",
solver=swe_bench_react_agent(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to set the solver here until https://github.com/coreweave/aviato-client/issues/32 is resolved.

swe_bench_agent_with_inspect_tool_support is the current solver in inspect_evals (see https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/src/inspect_evals/swe_bench/swe_bench_tasks.py#L286-L305) and requires bidirectional communication to enable the bash_sessions and text_editor tools.

swe_bench_react_agent fires one-off bash commands that conforms to our current exec paradigm

Comment on lines +160 to +171
instance_ids=[
"astropy__astropy-12907",
# "astropy__astropy-14182",
# "astropy__astropy-14365",
# "astropy__astropy-14995",
# "astropy__astropy-6938",
# "astropy__astropy-7746",
# "django__django-10914",
# "django__django-10924",
# "django__django-11001",
# "django__django-11019",
],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded for testing, i wonder if we'll want to allow users to select which instance_ids they want to run?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants