KIL-257: Fix tags not applied to synth data in subsequent generations by sfierro · Pull Request #862 · Kiln-AI/Kiln

sfierro · 2025-11-27T10:23:00Z

Summary

Fixes a bug where tags (split tags) were not being applied to synthetic data runs when generating more inputs/outputs after an initial save.

Problem

When using the synthetic data generation flow:

Generate inputs/outputs, then Save all → tags applied correctly ✓
Go back to Generate Inputs, generate more, then Generate Outputs, then Save all → tags NOT applied ✗

Root Cause

The get_random_split_tag() function was reading splits from guidance_data.splits, which is a separate store from saved_state.splits. These two stores are supposed to be synced via reactive statements, but the sync logic uses object reference comparison (!==) which fails when both stores reference the same object - causing the sync condition to always be false.

Fix

Updated get_random_split_tag() to read splits directly from saved_state.splits (the persisted source of truth) instead of guidance_data.splits. This aligns with how the QnA page handles splits.

Testing

All existing tests pass
Linting and type checking pass

Summary by CodeRabbit

Bug Fixes
- Fixed random split selection to use persisted state instead of temporary storage, ensuring consistent behavior across sessions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

The get_random_split_tag() function was reading from guidance_data.splits, which can get out of sync with saved_state.splits due to the reactive sync logic using object reference comparison. This caused split tags to not be applied when generating more outputs after an initial save. Fixed by reading splits directly from saved_state (the persisted source of truth) instead of guidance_data.splits, aligning with how the QnA page handles splits.

coderabbitai · 2025-11-27T10:23:14Z

Walkthrough

The pull request modifies the synth generation page to read random split selections from saved_state.splits instead of guidance_data.splits, ensuring split tags for sample generation are sourced from the persisted indexedDB state rather than in-memory guidance data.

Changes

Cohort / File(s)	Change Summary
Split state sourcing `app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte`	Updated random-split selection logic to read from persisted saved_state.splits instead of in-memory guidance_data.splits

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5–10 minutes

Verify that saved_state.splits is properly initialized and available in the component context
Confirm that switching from guidance_data.splits to saved_state.splits does not introduce stale-state issues
Check that the split tag selection behavior remains consistent with the original logic

Possibly related PRs

Data gen: store state, better UI #426: Introduces saved_state as a reactive store-backed field for splits, providing the foundation for this PR's data sourcing change
fix: synth data tags not deserialized correctly #743: Handles splits URL encoding/decoding for synth/generate routes, complementary to split state management
Synth data goal #422: Centralizes split tag sourcing for sample generation; this PR shifts the source from guidance_data to persisted state

Poem

🐰 From fleeting memory's fickle hold,
To IndexedDB's vault so bold,
The splits now persist, forever saved,
No more lost when browser waves,
Stability blooms in persistence' fold! 🌿

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description comprehensively covers the problem, root cause, and fix. However, the CLA checkbox and test checklist sections from the template are not included, leaving the description incomplete against the required template.	Add the missing Contributor License Agreement confirmation and test checklist sections to fully comply with the repository's PR description template.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the bug being fixed (tags not applied to synth data in subsequent generations) and references the ticket number, directly matching the main change in the PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch sfierro/KIL-257

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-11-27T10:25:12Z

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

No lines with coverage information in this diff.

📊 HTML Coverage Report - Interactive coverage report
📈 Diff Coverage Report - Detailed diff analysis
Github Actions Run - View the full coverage report

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte (1)
79-93: Consider deep comparison or restructuring the reactive sync logic.

The reactive statements use object reference comparison (!==), which fails to detect changes when both stores reference the same object—this was the root cause of the bug fixed in this PR. While the current fix on line 676 works around this by reading directly from saved_state, the sync logic remains fragile for future modifications.

Consider one of these approaches:

Use deep comparison (e.g., JSON.stringify) instead of reference comparison

Restructure to have a single source of truth without two-way sync

Use immutable updates when modifying splits to ensure new object references
-  $: if (is_setup && $saved_state.splits !== $splits) {
+  $: if (is_setup && JSON.stringify($saved_state.splits) !== JSON.stringify($splits)) {
     saved_state.update((s) => ({
       ...s,
       splits: $splits,
     }))
   }
   $: if (
     $saved_state.splits &&
     Object.keys($saved_state.splits).length > 0 &&
-    $saved_state.splits !== $splits
+    JSON.stringify($saved_state.splits) !== JSON.stringify($splits)
   ) {
     guidance_data.splits.set($saved_state.splits)
   }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a93fec1 and 5b92935.

📒 Files selected for processing (1)

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

app/web_ui/**/*.svelte

📄 CodeRabbit inference engine (.cursor/rules/project.mdc)

app/web_ui/**/*.svelte: Use Svelte v4 (not v5) for the web frontend UI
Use DaisyUI for UI components in the web frontend

Use Svelte v4 (not v5) for the web UI framework

Files:

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte

app/web_ui/**/*.{svelte,ts}

📄 CodeRabbit inference engine (.cursor/rules/project.mdc)

Use Tailwind CSS for styling in the web frontend

Files:

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte

app/web_ui/**/*.{svelte,css,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use Tailwind and DaisyUI for styling the frontend

Files:

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte

app/web_ui/**/*.{svelte,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Make FastAPI backend calls from the Svelte web app to communicate with backend servers

Files:

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte

🧠 Learnings (3)

📚 Learning: 2025-09-25T07:20:11.459Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 650
File: app/web_ui/src/routes/(app)/docs/library/[project_id]/[document_id]/+page.svelte:19-19
Timestamp: 2025-09-25T07:20:11.459Z
Learning: When analyzing relative import paths in SvelteKit route structures, be more careful to verify the actual directory structure exists before suggesting corrections. The import path ../../../../generate/[project_id]/[task_id]/table_button.svelte from app/web_ui/src/routes/(app)/docs/library/[project_id]/[document_id]/+page.svelte correctly resolves to the existing file at app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/table_button.svelte.

Applied to files:

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte

📚 Learning: 2025-10-24T05:01:15.465Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 736
File: app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/qna/qna_ui_store.ts:552-585
Timestamp: 2025-10-24T05:01:15.465Z
Learning: In the Q&A generation workflow (app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/qna/qna_ui_store.ts), chunking is intentionally only ensured for target="all" generation. When users generate Q&A for a specific document or part, they do not alter the chunking strategy; the UI only surfaces document/part-level generation actions after the user has already run the global chunking flow (or chosen to use full documents without chunking). This is by design to separate the chunking decision from per-item generation.
<!-- [/add_learning]

Applied to files:

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte

📚 Learning: 2025-05-09T17:33:29.787Z

Learnt from: scosman
Repo: Kiln-AI/Kiln PR: 296
File: libs/core/kiln_ai/utils/dataset_import.py:116-152
Timestamp: 2025-05-09T17:33:29.787Z
Learning: The tag_splits implementation in dataset_import.py uses deterministic allocation via integer division with remaining items assigned to the largest split, which is preferable over probability-based approaches to guarantee no splits are starved, especially with small datasets.

Applied to files:

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Generate Coverage Report
GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
GitHub Check: Build Desktop Apps (macos-15-intel)
GitHub Check: Build Desktop Apps (windows-latest)
GitHub Check: Build Desktop Apps (ubuntu-22.04)
GitHub Check: Build Desktop Apps (macos-latest)

🔇 Additional comments (1)

app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte (1)

674-676: LGTM! Fix correctly addresses the split tag synchronization bug.

Reading splits directly from saved_state.splits (the persisted source of truth) ensures that split tags are consistently applied to generated samples, even in subsequent generations. This sidesteps the reactive sync issues between guidance_data.splits and saved_state.splits.

sfierro · 2025-11-27T10:33:53Z

Thanks for the suggestion! The nitpick about using deep comparison for the reactive sync logic is a valid improvement, but I'll keep this PR focused on the minimal fix for the reported bug. The fragile sync logic could be addressed in a follow-up PR if it causes issues in other contexts.

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

sfierro closed this Nov 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KIL-257: Fix tags not applied to synth data in subsequent generations#862

KIL-257: Fix tags not applied to synth data in subsequent generations#862
sfierro wants to merge 1 commit intomainfrom
sfierro/KIL-257

sfierro commented Nov 27, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

sfierro commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sfierro commented Nov 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Fix

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Nov 27, 2025

📊 Coverage Report

Diff: origin/main...HEAD

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sfierro commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sfierro commented Nov 27, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 27, 2025 •

edited

Loading