Feature: print benchmark stats broken down by language #4522

itsmeknt · 2025-09-18T00:32:20Z

When running the Aider benchmark, sometimes it is useful to analyze the performance of the model according to the programming language. Some users may want to choose a model that do better specifically in Go, even though the overall benchmark score may be low.

I added some self-contained code to benchmark.py so that when you call benchmark.py --stats along with --verbose, it will print the benchmark stats broken down by each language at the bottom of the report. Without --verbose, the behavior is kept unchanged.

Here is an example:

./benchmark/benchmark.py --stats --verbose reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium/

──────────────────────────────────────────── reports_from_benchmarks/gpt-oss-20b/medium/whole/2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium ─────────────────────────────────────────────- dirname: 2025-09-12-09-53-14--bench-full-whole-openai-openai-gpt-oss-20b-medium
  test_cases: 225
  model: openai/openai/gpt-oss-20b
  edit_format: whole
  commit_hash: 32faf82-dirty
  reasoning_effort: medium
  pass_rate_1: 9.8
  pass_rate_2: 36.0
  pass_num_1: 22
  pass_num_2: 81
  percent_cases_well_formed: 100.0
  error_outputs: 27
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 154
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2162608
  completion_tokens: 1224921
  test_timeouts: 4
  total_tests: 225
  command: aider --model openai/openai/gpt-oss-20b
  date: 2025-09-12
  versions: 0.86.2.dev
  seconds_per_case: 801.2
  total_cost: 0.0000

costs: $0.0000/test-case, $0.00 total, $0.00 projected

======== Stats by language ========

| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
|                              |   python  |     go    |    rust   |    cpp    | javascript |    java   |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |
| completed_tests              |        34 |        39 |        30 |        26 |         49 |        47 |
| duration                     | 24,957.62 | 21,706.71 | 17,028.67 | 51,506.41 |  29,789.68 | 35,275.56 |
| avg_duration_per_test        |    734.05 |    556.58 |    567.62 |  1,981.02 |     607.95 |    750.54 |
| cost                         |         - |         - |         - |         - |          - |         - |
| pass_rate_0                  |      5.88 |      5.13 |      6.67 |      7.69 |       4.08 |      4.26 |
| pass_rate_1                  |     35.29 |     30.77 |     40.00 |     46.15 |      24.49 |     25.53 |
| pass_num_0                   |         2 |         2 |         2 |         2 |          2 |         2 |
| pass_num_1                   |        12 |        12 |        12 |        12 |         12 |        12 |
| error_outputs                |         7 |         2 |         3 |         - |         14 |         1 |
| user_asks                    |         1 |         1 |         - |       139 |          - |        13 |
| test_timeouts                |         - |         - |         1 |         - |          2 |         1 |
| exhausted_context_windows    |         - |         - |         - |         - |          - |         - |
| num_malformed_responses      |         - |         - |         - |         - |          - |         - |
| num_with_malformed_responses |         - |         - |         - |         - |          - |         - |
| syntax_errors                |         - |         - |         - |         - |          - |         - |
| indentation_errors           |         - |         - |         - |         - |          - |         - |
| lazy_comments                |         - |         - |         - |         - |          - |         - |
| prompt_tokens                |   204,931 |   159,565 |   127,949 | 1,078,034 |    247,566 |   344,563 |
| completion_tokens            |   138,725 |   159,982 |   128,591 |   379,616 |    185,134 |   232,873 |
| ---------------------------- | --------- | --------- | --------- | --------- | ---------- | --------- |

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Co-authored-by: aider (vertex_ai/gemini-2.5-pro-exp-03-25) <[email protected]>

- Add tool_prompt to CoderPrompts class - Modify fmt_system_prompt to include tool prompt when MCP tools are available - This enables better handling of tool-based interactions when using MCP servers

…o encourage it to remove

…lti-turn

…obustly

V0.87.6

…th-ask-responses' into v0.87.7

…lity and add test

…n un-pinning

V0.87.7

…tering to optimize performance

CLAassistant · 2025-09-18T00:32:27Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ itsmeknt
✅ cryptekbits
❌ dwash96
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…by_language' into v0.87.8

…nto v0.87.8

V0.87.8

V0.87.9

…stats

ei-grad and others added 30 commits April 14, 2025 22:15

feat: enhance coder prompts with planning and verification steps

26b9b14

Co-authored-by: aider (vertex_ai/gemini-2.5-pro-exp-03-25) <[email protected]>

Optimizations for when large numbers of files are added

62ed9d4

Add MCP Python SDK package dependency

c5414e2

Introduce and call MCP servers

162f49c

Models may use tools during completions

2c24084

Configurable mcp servers

16c4e62

Fix json encoding bug

7eaa92b

Set a tool_call_limit of 25

6a9b72e

Remove unused private function

99bec73

Account for None type content messages in tools

10ea9ba

Allow MCP servers list to partially initialize

282b349

Respect Aider confirmation settings

0976200

Only print empty response warning log if there are tools

5fd049b

Merge branch 'main' into feature/litellm-mcp

2696ce0

Add MCP tool prompt support to system prompts

de35fb6

- Add tool_prompt to CoderPrompts class - Modify fmt_system_prompt to include tool prompt when MCP tools are available - This enables better handling of tool-based interactions when using MCP servers

Fix typo in tool_prompt: add missing space before 'available'

0fe93b7

Merge branch 'main' into feature/litellm-mcp

374d69d

Update tool prompting to be more direct

943fabe

Merge branch 'main' into feature/litellm-mcp

c1a5e8d

Merge branch 'main' into feature/litellm-mcp

67595d2

Fix function interface of send_completion

dd32eef

Merge branch 'main' into feature/litellm-mcp

49ce3ff

feature: repomap cache directory configuration setting

1bbfa72

Merge branch 'main' into feature/litellm-mcp

d7966a6

Add Navigator

ef38604

Try to 'improve' prompts. Will see if these are better.

7fa6810

Include a section with a summary of the files available to the LLM, t…

0a2fd61

…o encourage it to remove

Instead of using Continue, just use the presence of tool calls for mu…

456d6f1

…lti-turn

Use a parens parser and then Python's ast.parse to parse tool calls r…

c2bdc82

…obustly

Allow escaping of tool calls by the model

be6f9b4

iamFIREcracker and others added 16 commits September 3, 2025 10:56

feat: Disable history for confirm_ask prompts to avoid clutter

4a1e5f3

dwash96#18: Sorry to everyone, crazy this threw no build errors

888ce5d

Fix uv command in readme

5d7e488

Fix exceptions.py to be more LiteLLM version agnostic

8c073f5

Merge remote-tracking branch 'main/main' into v0.87.6

9a9897d

Merge pull request dwash96#23 from dwash96/v0.87.6

4d147dc

V0.87.6

Bump Version

660ab49

Merge remote-tracking branch 'main/main' into v0.87.7

c566ba4

Merge remote-tracking branch 'history-cleanup/dont-clutter-history-wi…

a28d4e4

…th-ask-responses' into v0.87.7

Merge Aider PR 3056: Read-Only Stubs, Fix for fuzzy finding compatibi…

a2a2495

…lity and add test

Install aider-ce in inline pip install commands to account for versio…

0e7cf3b

…n un-pinning

Update README.md

7dffa54

Merge pull request dwash96#24 from dwash96/v0.87.7

8ec5d50

V0.87.7

Add --map-tokens option to benchmark and improve lazy loading of imports

c484463

Enhance RepoMap and Coder classes with in-memory caching and file fil…

46d961f

…tering to optimize performance

print benchmark stats by language

671c232

dwash96 and others added 13 commits September 21, 2025 10:33

Merge remote-tracking branch 'stats_by_language/feat/benchmark_stats_…

b90d8f6

…by_language' into v0.87.8

Merge remote-tracking branch 'benchmark_repo_map/Benchmark-Repomap' i…

b131728

…nto v0.87.8

Pre-Commit remove unused import

0d18c71

Update README.md

a3bb6c7

Bump Version

7c10fc3

Merge pull request dwash96#28 from dwash96/v0.87.8

7cd8a6b

V0.87.8

Bump Version

4c70164

Merge remote-tracking branch 'main/main' into v0.87.9

d278a0e

Merge pull request dwash96#31 from dwash96/v0.87.9

096040a

V0.87.9

fixed benchmark report per language stats bug

d25c73c

remove space

7a63e61

changed to printing of pass_rate_i to pass_rate_i+1 for per language …

8430ae2

…stats

fix whitespace

3eac8bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: print benchmark stats broken down by language #4522

Feature: print benchmark stats broken down by language #4522

Uh oh!

itsmeknt commented Sep 18, 2025

Uh oh!

CLAassistant commented Sep 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Feature: print benchmark stats broken down by language #4522

Are you sure you want to change the base?

Feature: print benchmark stats broken down by language #4522

Uh oh!

Conversation

itsmeknt commented Sep 18, 2025

Uh oh!

CLAassistant commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

CLAassistant commented Sep 18, 2025 •

edited

Loading