[Feature Request] Support Multimodal Inputs (Images, Audio, Video) in ACI Playground

### Required Pre-requisites

- [x] I have read the [Documentation](https://www.aci.dev/docs)
- [x] I have searched the [Issue Tracker](https://github.com/aipotheosis-labs/aci/issues) and [Discussions](https://github.com/aipotheosis-labs/aci/discussions) that this hasn't been reported yet.
- [x] Consider asking in [Discussions](https://github.com/aipotheosis-labs/aci/discussions) first

### Motivation

Description:
Currently, the Agent Playground in ACI.dev supports only text-based interactions, limiting testing for multimodal models (like GPT-4o or Gemini multimodal).

Feature Request:
Implement support for multimodal inputs (images, audio, video) within the Agent Playground interface, enabling comprehensive testing and demonstration of multimodal capabilities.

Use Cases:

- Quickly test image recognition and analysis.

- Validate agents responding to audio or video content.

- Demonstrate multimodal agent workflows to stakeholders.

Suggested Implementation:

- Allow users to upload files directly in the Playground.

- Encode files as Base64 (image_content/audio_content) or URLs.

- Clearly document supported file formats and size limitations.

Benefits:

Improved productivity for developers building multimodal agents.

Enhanced capabilities and appeal of ACI.dev Playground.

Additional context:
Many multimodal models are now standard (GPT-4o, Gemini 2.5), and enabling native support in the Playground would streamline workflows significantly.

### Proposed Solution

Suggested Implementation:

Add an intuitive file upload feature (drag-and-drop or upload button) in the playground interface.

Automatically encode uploaded media files into Base64 format for efficient transmission.

Structure payloads clearly, for example:
```

{
  "content": [
    { "type": "text", "text": "Describe this image" },
    { "type": "image_content", "mimeType": "image/png", "data": "<Base64_encoded_data>" }
  ]
}
```

- Enhance backend handling to support multimodal data by routing these requests to compatible multimodal models such as GPT-4o or Gemini multimodal.

- Establish clear validation rules, including file format support and maximum allowed file sizes (e.g., 5 MB for images, 10 MB for audio, and 50 MB for video).

Benefits:

- Streamlines testing and demonstration workflows.

- Ensures compatibility with modern multimodal models.

- Improves developer productivity and platform attractiveness.

Additional context:

Many multimodal models are becoming standard (GPT-4o, Gemini 2.5), and enabling native support in the Playground would significantly streamline agent testing workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support Multimodal Inputs (Images, Audio, Video) in ACI Playground #491

Required Pre-requisites

Motivation

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Support Multimodal Inputs (Images, Audio, Video) in ACI Playground #491

Description

Required Pre-requisites

Motivation

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions