-
Notifications
You must be signed in to change notification settings - Fork 456
Description
Required Pre-requisites
- I have read the Documentation
- I have searched the Issue Tracker and Discussions that this hasn't been reported yet.
- Consider asking in Discussions first
Motivation
Description:
Currently, the Agent Playground in ACI.dev supports only text-based interactions, limiting testing for multimodal models (like GPT-4o or Gemini multimodal).
Feature Request:
Implement support for multimodal inputs (images, audio, video) within the Agent Playground interface, enabling comprehensive testing and demonstration of multimodal capabilities.
Use Cases:
-
Quickly test image recognition and analysis.
-
Validate agents responding to audio or video content.
-
Demonstrate multimodal agent workflows to stakeholders.
Suggested Implementation:
-
Allow users to upload files directly in the Playground.
-
Encode files as Base64 (image_content/audio_content) or URLs.
-
Clearly document supported file formats and size limitations.
Benefits:
Improved productivity for developers building multimodal agents.
Enhanced capabilities and appeal of ACI.dev Playground.
Additional context:
Many multimodal models are now standard (GPT-4o, Gemini 2.5), and enabling native support in the Playground would streamline workflows significantly.
Proposed Solution
Suggested Implementation:
Add an intuitive file upload feature (drag-and-drop or upload button) in the playground interface.
Automatically encode uploaded media files into Base64 format for efficient transmission.
Structure payloads clearly, for example:
{
"content": [
{ "type": "text", "text": "Describe this image" },
{ "type": "image_content", "mimeType": "image/png", "data": "<Base64_encoded_data>" }
]
}
-
Enhance backend handling to support multimodal data by routing these requests to compatible multimodal models such as GPT-4o or Gemini multimodal.
-
Establish clear validation rules, including file format support and maximum allowed file sizes (e.g., 5 MB for images, 10 MB for audio, and 50 MB for video).
Benefits:
-
Streamlines testing and demonstration workflows.
-
Ensures compatibility with modern multimodal models.
-
Improves developer productivity and platform attractiveness.
Additional context:
Many multimodal models are becoming standard (GPT-4o, Gemini 2.5), and enabling native support in the Playground would significantly streamline agent testing workflows.