Skip to content

Add multi-file attachment support for tests, traces, and playground#1441

Merged
harry-rhesis merged 48 commits intomainfrom
feat/file-testing
Mar 4, 2026
Merged

Add multi-file attachment support for tests, traces, and playground#1441
harry-rhesis merged 48 commits intomainfrom
feat/file-testing

Conversation

@harry-rhesis
Copy link
Contributor

Purpose

Add comprehensive file attachment capabilities across the platform, enabling users to attach files (images, PDFs, audio, Excel, JSON) to tests as inputs, view file outputs in test results and traces, and upload files in the playground chat.

What Changed

Backend

  • File model & migration: New File table with polymorphic entity_id/entity_type pattern for associating files with Tests, TestResults, and other entities
  • File REST API: Full CRUD router for files with upload (multipart), download (streaming), and entity-scoped listing endpoints
  • Cascade config: File cleanup on parent entity deletion
  • Mixin enhancements: FileAttachmentMixin for models that support file attachments; improved lazy-load failure handling in relationship properties
  • Trace file linking: Link files from endpoint invocations to trace spans for file format filtering
  • Evaluation pipeline: Wire metadata (including file info) through to LLM prompts; serialize dict/list outputs as JSON
  • Test set service: Use CRUD layer for attribute updates; include test_set_type_id on creation
  • WebSocket: File attachment support in chat handler

Frontend

  • MultiFileUpload component: Drag-and-drop and click-to-upload with preview, validation, and removal
  • FileAttachmentList component: Display file chips with download support
  • useFiles hook: React hook for file CRUD operations against the files API client
  • Files API client: New client in api-client/ for file upload, download, list, and delete
  • Test creation: File upload/removal UI in manual test writer and test detail views
  • Playground chat: Attach files to messages with inline previews in MessageBubble
  • Test run detail: File sections and trace drawer showing per-turn file attachments
  • Metrics: Required field validation on creation form; icon and focus fixes
  • Constants: test-types.ts and score-types.ts for eliminating magic strings
  • Removed [DEBUG] prefix from API error logs

SDK

  • File entity: New SDK entity for file CRUD with upload/download helpers
  • Test entity: File attachment methods (attach_file, get_files, remove_file)
  • TestResult entity: File attachment support
  • API client: Multipart upload and streaming download support
  • Serializer: Handle dict/list output serialization
  • Native metrics: Pass file metadata through evaluation prompt template

Penelope (Multi-turn agent)

  • File attachment support in agent context, executor, and target interaction tools
  • Endpoint target passes files through multi-turn conversations

Chatbot

  • /chat endpoint accepts file uploads via multipart form data
  • Output mode parameter for JSON responses

Docs

  • Added endpoint mapping examples and single-turn endpoint documentation

Tests

  • Backend: File route tests, cascade deletion tests, file execution tests, integration tests
  • SDK: File entity unit tests, integration tests, serializer tests
  • Penelope: File attachment tests
  • Frontend: FileAttachmentList, MultiFileUpload, useFiles hook, files client tests

Testing

  • Run backend tests: cd apps/backend && uv run pytest ../../tests/backend/routes/test_file.py ../../tests/backend/services/test_file_cascade.py ../../tests/backend/services/test_file_execution.py ../../tests/backend/services/test_file_integration.py
  • Run SDK tests: cd sdk && uv run pytest ../tests/sdk/entities/test_file.py ../tests/sdk/integration/test_file.py
  • Run frontend tests: cd apps/frontend && npm test -- --testPathPattern="(FileAttachmentList|MultiFileUpload|useFiles|files-client)"
  • Manual: Upload files in playground chat, attach files to tests via manual writer, verify files appear in test run detail/trace views

Add polymorphic File model (BYTEA storage) for binary file attachments
on Tests (inputs) and TestResults (outputs). Files flow through the
execution pipeline as base64-encoded JSON and are stored/retrieved via
dedicated API endpoints.

- File model with deferred content, polymorphic entity pattern
- Upload/download/delete endpoints with size and MIME validation
- Nested routes: GET /tests/{id}/files, GET /test-results/{id}/files
- Execution pipeline: inject input files, capture output files
- Soft-delete cascade with entity_type isolation
- SDK serializer: bytes-to-base64 dump strategy
- Alembic migration for file table
- 75 tests (route, cascade, execution, integration, SDK)
The ManualTestWriter was sending only `name` when creating a test set,
but the backend requires `test_set_type_id`. Now resolves the correct
TestSetType lookup based on the user's single/multi-turn selection.
Also fixes TestSetDrawer querying wrong type_name ('TestType' instead
of 'TestSetType') and replaces magic strings with constants.
Add FilesClient API client, useFiles hook, MultiFileUpload and
FileAttachmentList components to support attaching images, PDFs, and
audio files to tests. Integrate file upload into the create test drawer,
test detail page, and manual test writer with per-row attachments.
Includes 52 new tests covering the client, hook, and components.
Replace all 'Single-Turn', 'Multi-Turn', 'TestType', and 'TestSetType'
magic strings across 19 files with TEST_TYPES and TYPE_NAMES constants
from constants/test-types.ts. Also update TypeScript type annotations
to use TestTypeValue and MetricScope types instead of inline unions.
- Remove empty prompt_id that caused UUID validation error
- Fix test set association for individual test creation path
  by calling associateTestsWithTestSet after creating tests
- Remove unused test_set_id from buildTestPayload since the
  single create endpoint doesn't support it
- Redirect to test set detail page after saving when a test
  set name is provided
Multi-turn test files attach to spans during execution, not to the
test entity itself. Hide the Files column and attachment button when
creating multi-turn tests in the manual test writer.
Expose the database row UUID on SpanNode so the frontend can fetch files
attached to individual trace spans. Add a read-only Attachments card in
the Span Details panel that shows files when they exist.
…ocations

Add Jinja2 filters (to_anthropic, to_openai, to_gemini) that transform
input files into provider-specific content formats in request mappings.
Switch TemplateRenderer to use a Jinja2 Environment with registered
filters and auto-parse JSON filter output.

Store input files as File records linked to Trace entities when
endpoints are invoked with file attachments. REST/WebSocket invokers
create files synchronously after span storage. SDK invokers use
deferred linking: files are parked in the Redis/memory cache at
invocation time and created when SDK spans arrive at telemetry ingest.

Update endpoint and test documentation to describe file support,
file format filters, and provider-specific mapping examples.
Accept base64-encoded files in the JSON request body and extract
their text content using the SDK's DocumentExtractor (MarkItDown).
File contents are injected into the LLM prompt between the system
prompt and conversation history. A field_validator coerces empty
strings to None for backward compatibility with callers that send
files: "". Updated all use case prompts to permit file operations.
…ield name

Increase WebSocket max message size from 64KB to 10MB to accommodate
base64-encoded file attachments. Pass files from WebSocket chat payload
to endpoint input_data and return output_files in the response.

Rename file data field from content_base64 to data for consistency
across input file loading and output file storage.
Add file upload button and drag-and-drop support to PlaygroundChat.
Render file attachment previews (images, PDFs) in MessageBubble.
Extend WebSocket types to include file metadata for chat messages.
Make FileAttachmentList items clickable to trigger authenticated file
downloads via the /files/{id}/content endpoint. Each row now has a
ListItemButton for click-to-download and an explicit download icon
button in the secondary action area.

In MessageBubble, make user-attached file chips and output image
previews clickable for download. Replace AttachFileIcon with
DownloadIcon on user file chips to signal downloadability. Replace
hardcoded image maxHeight with theme spacing.
Move the file attachment button from a standalone button beside the
input to a startAdornment inside the TextField, aligning with common
chat UI patterns. Also size the reset button to match the send button.
Enable Penelope to include test-attached files (images, PDFs, audio)
when sending messages to target endpoints. The LLM decides per-message
whether to include files via include_files parameter, controlled by
test instructions.

Data flow: backend loads files → TestContext.files → system prompt
informs agent → LLM sets include_files=True → executor injects files
→ TargetInteractionTool → Target.send_message(files=...) → endpoint.
Add File entity with upload (from paths and base64), download, and
delete support. Extend Test with add_files(), get_files(), delete_file()
and inline files via push(). Add get_files() to TestResult. Includes
unit and integration tests.
Thread mapped metadata from endpoint responses through to metric
evaluation, allowing evaluation criteria to reference response
metadata in their prompts.
Introduce a "mode" parameter ("text" or "json") across the chatbot
response chain. When mode is "json", uses Pydantic schema-based
generation via the SDK model provider to return structured output.
Ensure dict/list outputs from JSON mode are serialized with
json.dumps() instead of str() across the invocation pipeline,
tracing, response extraction, and conversation storage.
Extend the file input accept attribute to include JSON, Excel
(.xlsx, .xls), and CSV files alongside existing image and PDF support.
- Add metadata and context as collapsible sections in overview tab
- Auto-detect and pretty-print JSON content in all text fields
- Add fontFamilyCode theme token for monospace rendering
- Show test result files via FileAttachmentList component
- Add "Go to Test" button linking to test detail page
- Show N/Total progress in Tests Executed card
- Add refresh button while test run is in progress
Edit page: replace per-keystroke state updates with ref-based dirty
tracking and blur-triggered re-renders; cache stable ref callbacks
for dynamic evaluation step TextFields to prevent React remounting.

New metric page: use stable index-based keys instead of content-derived
keys that changed on every keystroke causing React to remount elements.
Wire conversation tab responses to open trace drawer on click,
mapping all turns to the shared multi-turn trace. Split files
into collapsible "Files" and "Output Files" sections for both
single-turn and multi-turn views.
Add inline error highlighting on Next click for required fields (name,
evaluation prompt, metric scope, and score-type-conditional fields).
Replace magic score type strings with SCORE_TYPES constants and add
backend model_validator for numeric/categorical conditional validation.
After delete_item() commits, the RLS session variable
(app.current_organization) may no longer be set on the DB connection,
causing lazy-load queries during response serialization to fail with
ProgrammingError. This resulted in 500 errors on delete even though
the deletion itself succeeded.

Add safe_relationship decorator that catches SQLAlchemy errors on
relationship property access and returns safe defaults instead of
propagating the error to the response.
Replace raw db.query() in update_test_set_attributes with
crud.get_test_set which applies proper RLS filtering, organization
scoping, and soft-delete exclusion. Gracefully skip updates when the
test set has been soft-deleted instead of raising ValueError.

Pass organization_id and user_id through all call sites.
Update down_revision to chain after the litellm/azure provider
migration introduced on main.
- Replace hardcoded borderRadius values with theme.shape.borderRadius
  in MessageBubble and FileAttachmentList components
- Pass missing sessionToken prop to TestDetailConversationTab in
  TestResultDrawer
Copy link

@peqy peqy bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical

  • File format Jinja filters return JSON strings and will double-encode with |tojson, producing wrong request bodies.
  • Backend file execution tests use content_base64 but implementation uses data.

Improvements

  • File upload ordering (position) resets each request.
  • Telemetry span files endpoint lacks explicit auth dependency.

Nit

  • Alembic migration docstring Revises doesn’t match down_revision.

Found 6 issues (2 critical, 2 improvements, 1 nit, 1 question-ish auth concern).

},
}
)
return json.dumps(content)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: these filters return json.dumps(...) strings, but the docs and templating flow expect provider filters to return Python objects and then use |tojson.

With the current implementation, templates like {{ files | to_anthropic | tojson }} will double-encode and TemplateRenderer will json.loads back to a string, so the request body will contain a JSON string instead of an array/object.

Fix: have to_anthropic/to_openai/to_gemini return list[dict] (or dict) directly (no json.dumps), letting |tojson handle JSON serialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 60da1ab — filters now return list[dict] instead of json.dumps() strings, so they compose correctly with |tojson.

{
"filename": "test.png",
"content_type": "image/png",
"content_base64": base64.b64encode(file_content).decode("ascii"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: test uses content_base64, but the implementation uses data as the base64 key (SingleTurnOutput._load_input_files emits data, and _store_output_files reads file_data.get('data')).

Fix: update tests to use/expect data (or add backward-compat in backend to accept both keys).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 60da1ab — updated all tests to use data key to match the implementation.

{
"filename": "input.png",
"content_type": "image/png",
"content_base64": base64.b64encode(png_bytes).decode("ascii"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as test_file_execution.py: this uses content_base64 but the pipeline uses data.

Fix: switch to data everywhere in these tests (and the output_files fixture).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the same commit (60da1ab).

size_bytes=file_size,
content=file_bytes,
entity_id=entity_id,
entity_type=entity_type,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement: position=idx will reset to 0..N for each upload request. If users upload more files later, ordering may collide/interleave unexpectedly.

Fix: set position = existing_count + idx (or max(position)+1) for append semantics, or document that clients must reorder via update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 60da1ab and refined in 1d50f0b — position now uses append semantics (max(position) + 1 + idx). Moved the query to the CRUD layer with organization_id filtering to follow existing RLS patterns.

a polymorphic entity_id + entity_type pattern.

Revision ID: b3f7a9c2d1e4
Revises: aef6c47a8faa
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: migration docstring says Revises: aef6c47a8faa, but down_revision is a7b8c9d0e1f2.

Fix: update the docstring header to match to avoid confusion during ops/debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 60da1ab — docstring now matches down_revision.

)


@router.get(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement: GET /telemetry/spans/{span_db_id}/files doesn't require current_user/token (unlike most other routers). If telemetry endpoints are expected to be protected, this could expose file metadata cross-tenant depending on RLS configuration.

Fix: add current_user: User = Depends(require_current_user_or_token) (or whatever auth pattern telemetry routes use), or confirm telemetry router is intentionally public/auth'd elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change needed — auth is enforced transitively via the get_tenant_context dependency, which internally depends on require_current_user_or_token. This is consistent with all other telemetry endpoints in the same router (none have an explicit current_user param).

harry-rhesis and others added 5 commits March 4, 2026 22:24
The user_id field in TestRunBase lacked a default value, making it
required by Pydantic. After the SDK switched to exclude_none=True,
user_id was omitted from requests, causing 422 validation errors.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…creation

- Replace unavailable SiMistralai and SiOpenrouter icons with MUI
  fallback icons (not exported in installed react-simple-icons version)
- Make prompt_id optional in TestCreate for inline prompt creation
- Add explicit TestCreate return type to buildTestPayload
Auto-format line length, indentation, and arrow function
parentheses across frontend source and test files.
Revert SiMistralai and SiOpenrouter icon fallbacks and update
@icons-pack/react-simple-icons to v13.12.0 which exports them.
Multi-turn tests and inline prompt creation don't require prompt_id
upfront. All existing usages already guard with null checks.
@peqy
Copy link

peqy bot commented Mar 4, 2026

Main issues found:

  • Critical: File attachment payload key mismatch (data vs content_base64). Runtime code paths (execution pipeline + templating filters + trace linking + output file storage) use data, but backend tests use content_base64, so tests will fail and any clients using content_base64 will have output files silently dropped. Suggested fix: standardize on data everywhere and/or accept both keys when decoding.

  • Improvement: Alembic migration header (Revises:) doesn’t match down_revision.

  • Improvement: /files/{id}/content sets Content-Disposition using unsanitized filenames; should be escaped/sanitized to avoid header issues.

Found 3 issues (1 critical, 2 improvements).

Copy link

@peqy peqy bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes due to a contract mismatch that will break file execution/output capture.

  • Fix data vs content_base64 mismatch (tests + any decoding paths). Ideally standardize on data everywhere; if backwards compatibility is needed, accept both keys when decoding.
  • Update migration header Revises: to match down_revision.
  • Sanitize Content-Disposition filename to avoid header issues.

{
"filename": "test.png",
"content_type": "image/png",
"content_base64": base64.b64encode(file_content).decode("ascii"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: The execution pipeline uses files[*].data (base64) as the canonical key (see SingleTurnOutput._load_input_files() and templating filters), but these tests use content_base64. As written, this will fail and it also diverges from the API/docs/front-end.

Fix: update test fixtures to use data (or, if we need backwards-compat, accept both data and content_base64 in the backend when decoding).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — see reply above.

{
"filename": "input.png",
"content_type": "image/png",
"content_base64": base64.b64encode(png_bytes).decode("ascii"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Same issue as test_file_execution.py: these mocked input files use content_base64, but the runtime path expects data.

Fix: rename to data in these tests (and in any endpoint contract examples) or add backwards-compat decoding for content_base64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — see reply above.

if not isinstance(file_data, dict):
continue

content_b64 = file_data.get("data")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: _store_output_files() reads base64 from file_data.get("data"), but the tests (and likely some clients) are using content_base64. This mismatch will silently skip storing output files.

Fix: align contract. Easiest: accept both keys:

content_b64 = file_data.get("data") or file_data.get("content_base64")

(and update tests/docs to use data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation correctly uses data as the key. The tests were wrong — fixed in 60da1ab to use data consistently.

a polymorphic entity_id + entity_type pattern.

Revision ID: b3f7a9c2d1e4
Revises: aef6c47a8faa
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement: Migration header says Revises: aef6c47a8faa but down_revision is a7b8c9d0e1f2. That inconsistency is confusing when debugging migrations.

Fix: update the docstring Revises: line to match down_revision (or vice versa if the chain is wrong).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — see reply above.

BytesIO(db_file.content),
media_type=db_file.content_type,
headers={
"Content-Disposition": f'attachment; filename="{db_file.filename}"',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement: Content-Disposition is built from the raw filename. If filenames can contain quotes/newlines, this can break headers or enable response splitting.

Fix: sanitize/quote per RFC 5987/6266 (e.g. filename*=UTF-8''...) or at least strip CR/LF and quotes before interpolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — will address filename sanitization in a follow-up.

harry-rhesis and others added 8 commits March 4, 2026 22:47
- Return Python objects from Jinja file filters instead of JSON
  strings to prevent double-encoding with |tojson
- Fix test key mismatch: use 'data' instead of 'content_base64' to
  match the actual implementation in output_providers and results
- Use append semantics for file upload position (existing max + idx)
  to avoid ordering collisions across multiple upload requests
- Fix migration docstring Revises to match down_revision
Guard against undefined prompt_id in TestDetailData and UpdateTest
after making it optional in TestBase.
For new users, polyphemus_access is explicitly null in the JSONB
column, so dict.get() returns None instead of the default {}.
Follow existing RLS patterns by adding get_entity_files_max_position
to the CRUD layer with organization_id filtering, replacing the
inline query in the file router.
The websocket size limit tests were written for a 64KB limit but the
router uses 10MB. Messages under 10MB passed the check, handle_message
returned nothing, and receive_json() blocked forever, hanging the CI
pipeline.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
MetricDataFactory was generating categorical metrics without categories
and numeric metrics without min_score/max_score/threshold, causing 422
validation errors. Also fixed TopicDataFactory long_name edge case
generating names shorter than the 100-char test assertion.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
When endpoint responses contain JSON objects, the Markdown component
crashes because it expects a string. Coerce non-string content to a
fenced JSON code block for proper rendering.
- Add SDK docs for the File entity (upload, download, delete)
- Document metadata as a data source for custom metric evaluation
- Update endpoint docs to reflect that metadata is available to metrics
- Add metadata evaluation example to SDK single-turn metrics docs

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@harry-rhesis harry-rhesis merged commit 4868712 into main Mar 4, 2026
18 checks passed
@harry-rhesis harry-rhesis deleted the feat/file-testing branch March 4, 2026 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant