Skip to content

[Feature Request] Support Multimodal Inputs (Images, Audio, Video) in ACI Playground #491

@ivan-perez-sl

Description

@ivan-perez-sl

Required Pre-requisites

Motivation

Description:
Currently, the Agent Playground in ACI.dev supports only text-based interactions, limiting testing for multimodal models (like GPT-4o or Gemini multimodal).

Feature Request:
Implement support for multimodal inputs (images, audio, video) within the Agent Playground interface, enabling comprehensive testing and demonstration of multimodal capabilities.

Use Cases:

  • Quickly test image recognition and analysis.

  • Validate agents responding to audio or video content.

  • Demonstrate multimodal agent workflows to stakeholders.

Suggested Implementation:

  • Allow users to upload files directly in the Playground.

  • Encode files as Base64 (image_content/audio_content) or URLs.

  • Clearly document supported file formats and size limitations.

Benefits:

Improved productivity for developers building multimodal agents.

Enhanced capabilities and appeal of ACI.dev Playground.

Additional context:
Many multimodal models are now standard (GPT-4o, Gemini 2.5), and enabling native support in the Playground would streamline workflows significantly.

Proposed Solution

Suggested Implementation:

Add an intuitive file upload feature (drag-and-drop or upload button) in the playground interface.

Automatically encode uploaded media files into Base64 format for efficient transmission.

Structure payloads clearly, for example:


{
  "content": [
    { "type": "text", "text": "Describe this image" },
    { "type": "image_content", "mimeType": "image/png", "data": "<Base64_encoded_data>" }
  ]
}
  • Enhance backend handling to support multimodal data by routing these requests to compatible multimodal models such as GPT-4o or Gemini multimodal.

  • Establish clear validation rules, including file format support and maximum allowed file sizes (e.g., 5 MB for images, 10 MB for audio, and 50 MB for video).

Benefits:

  • Streamlines testing and demonstration workflows.

  • Ensures compatibility with modern multimodal models.

  • Improves developer productivity and platform attractiveness.

Additional context:

Many multimodal models are becoming standard (GPT-4o, Gemini 2.5), and enabling native support in the Playground would significantly streamline agent testing workflows.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions