Skip to content

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek), custom tasks in YAML, and HTML/CSV reports.

License

Notifications You must be signed in to change notification settings

petmal/MindTrial

Repository files navigation

MindTrial

Build Go Report Card License: MPL 2.0 Go Version Go Reference

MindTrial helps you assess and compare the performance of AI language models (LLMs) on tasks that use text prompts, with optional file or image attachments. Use it to evaluate a single model or test multiple models from OpenAI, Google, Anthropic, and DeepSeek side by side, and get easy-to-read results in HTML and CSV formats.

Quick Start Guide

  1. Install the tool:

    go install github.com/petmal/mindtrial/cmd/mindtrial@latest
  2. Run with default settings:

    mindtrial run

Prerequisites

  • Go 1.23
  • API keys from your chosen AI providers

Key Features

  • Compare multiple AI models at once
  • Create custom test tasks using simple YAML files
  • Attach files or images to prompts for visual tasks
  • Get results in HTML and CSV formats
  • Easy to extend with new AI models
  • Smart rate limiting to prevent API overload
  • Interactive mode with terminal-based UI

Basic Usage

  1. Display available commands and options:

    mindtrial help
  2. Run with custom configuration and output options:

    mindtrial --config="custom-config.yaml" --tasks="custom-tasks.yaml" --output-dir="./results" --output-basename="custom-tasks-results" run
  3. Run with specific output formats (CSV only, no HTML):

    mindtrial --csv=true --html=false run
  4. Run in interactive mode to select models and tasks before starting:

    mindtrial --interactive run

Configuration Guide

MindTrial uses two simple YAML files to control everything:

1. config.yaml - Application Settings

Controls how MindTrial operates, including:

  • Where to save results
  • Which AI models to use
  • API settings and rate limits

2. tasks.yaml - Task Definitions

Defines what you want to test, including:

  • Questions/prompts for the AI
  • Expected answers
  • Response format rules

Tip

New to MindTrial? Start with the example files provided and modify them for your needs.

Tip

Use interactive mode with the --interactive flag to select model configurations and tasks before running, without having to edit configuration files.

config.yaml

This file defines the tool's settings and target model configurations evaluated during the trial run. The following properties are required:

  • output-dir: Path to the directory where results will be saved.
  • task-source: Path to the file with definitions of tasks to run.
  • providers: List of providers (i.e. target LLM configurations) to execute tasks during the trial run.
    • name: Name of the LLM provider (e.g. openai).
    • client-config: Configuration for this provider's client (e.g. API key).
    • runs: List of runs (i.e. model configurations) for this provider. Unless disabled, all configurations will be trialed.
      • name: Display-friendly name to be shown in the results.
      • model: Model name must be exactly as defined by the backend service's API (e.g. gpt-4o-mini).

Important

All provider names must match exactly:

  • openai: OpenAI GPT models
  • google: Google Gemini models
  • anthropic: Anthropic Claude models
  • deepseek: DeepSeek open-source models

Note

Anthropic and DeepSeek providers support configurable request timeout in the client-config section:

  • request-timeout: Sets the timeout duration for API requests (i.e. thinking).

Note

Some models support additional model-specific runtime configuration parameters. These can be provided in the model-parameters section of the run configuration.

Currently supported parameters for OpenAI models include:

  • text-response-format: If true, use plain-text response format (less reliable) for compatibility with models that do not support JSON.
  • reasoning-effort: Controls effort on reasoning for reasoning models (i.e. low, medium, high).
  • temperature: Controls randomness/creativity of responses (range: 0.0 to 2.0, default: 1.0). Lower values produce more focused and deterministic outputs.
  • top-p: Controls diversity via nucleus sampling (range: 0.0 to 1.0, default: 1.0). Lower values produce more focused outputs.
  • presence-penalty: Penalizes new tokens based on their presence in text so far (range: -2.0 to 2.0, default: 0.0). Positive values encourage model to use new tokens.
  • frequency-penalty: Penalizes new tokens based on their frequency in text so far (range: -2.0 to 2.0, default: 0.0). Positive values encourage model to use less frequent tokens.

Currently supported parameters for Anthropic models include:

  • max-tokens: Controls the maximum number of tokens available to the model for generating a response.
  • thinking-budget-tokens: Enables enhanced reasoning capabilities when set. Specifies the number of tokens the model can use for its internal reasoning process. Must be at least 1024 and less than max-tokens.
  • temperature: Controls randomness/creativity of responses (range: 0.0 to 1.0, default: 1.0). Lower values produce more focused and deterministic outputs.
  • top-p: Controls diversity via nucleus sampling (range: 0.0 to 1.0). Lower values produce more focused outputs.
  • top-k: Limits tokens considered for each position to top K options. Higher values allow more diverse outputs.

Currently supported parameters for Google models include:

  • text-response-format: If true, use plain-text response format (less reliable) for compatibility with models that do not support JSON.
  • temperature: Controls randomness/creativity of responses (range: 0.0 to 2.0, default: 1.0). Lower values produce more focused and deterministic outputs.
  • top-p: Controls diversity via nucleus sampling (range: 0.0 to 1.0). Lower values produce more focused outputs.
  • top-k: Limits tokens considered for each position to top K options. Higher values allow more diverse outputs.

Currently supported parameters for DeepSeek models include:

  • temperature: Controls randomness/creativity of responses (range: 0.0 to 2.0, default: 1.0). Lower values produce more focused and deterministic outputs.
  • top-p: Controls diversity via nucleus sampling (range: 0.0 to 1.0). Lower values produce more focused outputs.
  • presence-penalty: Penalizes new tokens based on their presence in text so far (range: -2.0 to 2.0, default: 0.0). Positive values encourage model to use new tokens.
  • frequency-penalty: Penalizes new tokens based on their frequency in text so far (range: -2.0 to 2.0, default: 0.0). Positive values encourage model to use less frequent tokens.

Note

The results will be saved to <output-dir>/<output-basename>.<format>. If the result output file already exists, it will be replaced. If the log file already exists, it will be appended to.

Tip

The following placeholders are available for output paths and names:

  • {{.Year}}: Current year
  • {{.Month}}: Current month
  • {{.Day}}: Current day
  • {{.Hour}}: Current hour
  • {{.Minute}}: Current minute
  • {{.Second}}: Current second

Tip

If log-file and/or output-basename is blank, the log and/or output will be written to the stdout.

Note

MindTrial processes tasks across different AI providers simultaneously (in parallel). However, when running multiple configurations from the same provider (e.g. different OpenAI models), these are processed one after another (sequentially).

Tip

Models can use the max-requests-per-minute property in their run configurations to limit the number of requests made per minute.

Tip

To disable all run configurations for a given provider, set disabled: true on that provider. An individual run configuration can override this by setting disabled: false (e.g. to enable just that one configuration).

Example snippet from config.yaml:

# config.yaml
config:
  log-file: ""
  output-dir: "./results/{{.Year}}-{{.Month}}-{{.Day}}/"
  output-basename: "{{.Hour}}-{{.Minute}}-{{.Second}}"
  task-source: "./tasks.yaml"
  providers:
    - name: openai
      disabled: true
      client-config:
        api-key: "<your-api-key>"
      runs:
        - name: "4o-mini - latest"
          disabled: false
          model: "gpt-4o-mini"
          max-requests-per-minute: 3
        - name: "o1-mini - latest"
          model: "o1-mini"
          max-requests-per-minute: 3
          model-parameters:
            text-response-format: true
        - name: "o3-mini - latest (high reasoning)"
          model: "o3-mini"
          max-requests-per-minute: 3
          model-parameters:
            reasoning-effort: "high"
    - name: anthropic
      client-config:
        api-key: "<your-api-key>"
      runs:
        - name: "Claude 3.7 Sonnet - latest"
          model: "claude-3-7-sonnet-latest"
          max-requests-per-minute: 5
          model-parameters:
            max-tokens: 4096
        - name: "Claude 3.7 Sonnet - latest (extended thinking)"
          model: "claude-3-7-sonnet-latest"
          max-requests-per-minute: 5
          model-parameters:
            max-tokens: 8192
            thinking-budget-tokens: 2048
    - name: deepseek
      client-config:
        api-key: "<your-api-key>"
        request-timeout: 10m
      runs:
        - name: "DeepSeek-R1 - latest"
          model: "deepseek-reasoner"
          max-requests-per-minute: 15

tasks.yaml

This file defines the tasks to be executed on all enabled run configurations. Each task must define the following four properties:

  • name: Display-friendly name to be shown in the results.
  • prompt: The prompt (i.e. task) that will be sent to the AI model.
  • response-result-format: Defines how the AI should format the final answer to the prompt. This is important because the final answer will be compared to the expected-result and it needs to consistently follow the same format.
  • expected-result: This defines the expected (i.e. valid) final answer to the prompt. It must follow the response-result-format precisely.

Optionally, a task can include a list of files to be sent along with the prompt:

  • files: A list of files to attach to the prompt. Each file entry defines the following properties:
    • name: A unique name for the file, used for reference within the prompt if needed.
    • uri: The path or URI to the file. Local file paths and remote HTTP/HTTPS URLs are supported. The file content will be downloaded and sent with the request.
    • type: The MIME type of the file (e.g., image/png, image/jpeg). If omitted, the tool will attempt to infer the type based on the file extension or content.

Note

If a task includes files, it will be skipped for any provider configuration that does not support file uploads or does not support the specific file type.

Note

Currently supported image types include: image/jpeg, image/jpg, image/png, image/gif, image/webp. Support may vary by provider.

Note

Currently, the letter-case is ignored when comparing the final answer to the expected-result.

Tip

To disable all tasks by default, set disabled: true in the task-config section. An individual task can override this by setting disabled: false (e.g. to enable just that one task).

A sample task from tasks.yaml:

# tasks.yaml
task-config:
  disabled: true
  tasks:
    - name: "riddle - split words - v1"
      disabled: false
      prompt: |-
        There are four 8-letter words (animals) that have been split into 2-letter pieces.
        Find these four words by putting appropriate pieces back together:

        RR TE KA DG EH AN SQ EL UI OO HE LO AR PE NG OG
      response-result-format: |-
        list of words in alphabetical order separated by ", "
      expected-result: |-
        ANTELOPE, HEDGEHOG, KANGAROO, SQUIRREL
    - name: "visual - shapes - v1"
      prompt: |-
        The attached picture contains various shapes marked by letters.
        It also contains a set of same shapes that have been rotated marked by numbers.
        Your task is to find all matching pairs.
      response-result-format: |-
        <shape number>: <shape letter> pairs separated by ", " and ordered by shape number
      expected-result: |-
        1: G, 2: F, 3: B, 4: A, 5: C, 6: D, 7: E
      files:
        - name: "picture"
          uri: "./taskdata/visual-shapes-v1.png"
          type: "image/png"

Command Reference

mindtrial [options] [command]

Commands:
  run                       Start the trials
  help                      Show help
  version                   Show version

Options:
  --config string           Configuration file path (default: config.yaml)
  --tasks string            Task definitions file path
  --output-dir string       Results output directory
  --output-basename string  Base filename for results; replace if exists; blank = stdout
  --html                    Generate HTML output (default: true)
  --csv                     Generate CSV output (default: false)
  --log string              Log file path; append if exists; blank = stdout
  --verbose                 Enable detailed logging
  --debug                   Enable low-level debug logging (implies --verbose)
  --interactive             Enable interactive interface for run configuration, and real-time progress monitoring (default: false)

Contributing

Contributions are welcome! Please review our CONTRIBUTING.md guidelines for more details.

Getting the Source Code

Clone the repository and install dependencies:

git clone https://github.com/petmal/mindtrial.git
cd mindtrial
go mod download

Running Tests

Execute the unit tests with:

go test -tags=test -race -v ./...

Project Details

/
├── cmd/
│   └── mindtrial/       # Command-line interface and main entry point
│       └── tui/         # Terminal-based UI and interactive mode functionality
├── config/              # Data models and management for configuration and task definitions
├── formatters/          # Output formatting for results
├── pkg/                 # Shared packages and utilities
├── providers/           # AI model service provider connectors
├── runners/             # Task execution and result aggregation
└── version/             # Application metadata

License

This project is licensed under the Mozilla Public License 2.0 - see the LICENSE file for details.

Releases

No releases published

Languages