Skip to content

KIL-257: Fix tags not applied to synth data in subsequent generations#862

Closed
sfierro wants to merge 1 commit intomainfrom
sfierro/KIL-257
Closed

KIL-257: Fix tags not applied to synth data in subsequent generations#862
sfierro wants to merge 1 commit intomainfrom
sfierro/KIL-257

Conversation

@sfierro
Copy link
Contributor

@sfierro sfierro commented Nov 27, 2025

Summary

Fixes a bug where tags (split tags) were not being applied to synthetic data runs when generating more inputs/outputs after an initial save.

Problem

When using the synthetic data generation flow:

  1. Generate inputs/outputs, then Save all → tags applied correctly ✓
  2. Go back to Generate Inputs, generate more, then Generate Outputs, then Save all → tags NOT applied ✗

Root Cause

The get_random_split_tag() function was reading splits from guidance_data.splits, which is a separate store from saved_state.splits. These two stores are supposed to be synced via reactive statements, but the sync logic uses object reference comparison (!==) which fails when both stores reference the same object - causing the sync condition to always be false.

Fix

Updated get_random_split_tag() to read splits directly from saved_state.splits (the persisted source of truth) instead of guidance_data.splits. This aligns with how the QnA page handles splits.

Testing

  • All existing tests pass
  • Linting and type checking pass

Summary by CodeRabbit

  • Bug Fixes
    • Fixed random split selection to use persisted state instead of temporary storage, ensuring consistent behavior across sessions.

✏️ Tip: You can customize this high-level summary in your review settings.

The get_random_split_tag() function was reading from guidance_data.splits,
which can get out of sync with saved_state.splits due to the reactive sync
logic using object reference comparison. This caused split tags to not be
applied when generating more outputs after an initial save.

Fixed by reading splits directly from saved_state (the persisted source of
truth) instead of guidance_data.splits, aligning with how the QnA page
handles splits.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 27, 2025

Walkthrough

The pull request modifies the synth generation page to read random split selections from saved_state.splits instead of guidance_data.splits, ensuring split tags for sample generation are sourced from the persisted indexedDB state rather than in-memory guidance data.

Changes

Cohort / File(s) Change Summary
Split state sourcing
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
Updated random-split selection logic to read from persisted saved_state.splits instead of in-memory guidance_data.splits

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5–10 minutes

  • Verify that saved_state.splits is properly initialized and available in the component context
  • Confirm that switching from guidance_data.splits to saved_state.splits does not introduce stale-state issues
  • Check that the split tag selection behavior remains consistent with the original logic

Possibly related PRs

Poem

🐰 From fleeting memory's fickle hold,
To IndexedDB's vault so bold,
The splits now persist, forever saved,
No more lost when browser waves,
Stability blooms in persistence' fold! 🌿

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The description comprehensively covers the problem, root cause, and fix. However, the CLA checkbox and test checklist sections from the template are not included, leaving the description incomplete against the required template. Add the missing Contributor License Agreement confirmation and test checklist sections to fully comply with the repository's PR description template.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the bug being fixed (tags not applied to synth data in subsequent generations) and references the ticket number, directly matching the main change in the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sfierro/KIL-257

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

No lines with coverage information in this diff.


Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte (1)

79-93: Consider deep comparison or restructuring the reactive sync logic.

The reactive statements use object reference comparison (!==), which fails to detect changes when both stores reference the same object—this was the root cause of the bug fixed in this PR. While the current fix on line 676 works around this by reading directly from saved_state, the sync logic remains fragile for future modifications.

Consider one of these approaches:

  • Use deep comparison (e.g., JSON.stringify) instead of reference comparison
  • Restructure to have a single source of truth without two-way sync
  • Use immutable updates when modifying splits to ensure new object references
-  $: if (is_setup && $saved_state.splits !== $splits) {
+  $: if (is_setup && JSON.stringify($saved_state.splits) !== JSON.stringify($splits)) {
     saved_state.update((s) => ({
       ...s,
       splits: $splits,
     }))
   }
   $: if (
     $saved_state.splits &&
     Object.keys($saved_state.splits).length > 0 &&
-    $saved_state.splits !== $splits
+    JSON.stringify($saved_state.splits) !== JSON.stringify($splits)
   ) {
     guidance_data.splits.set($saved_state.splits)
   }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a93fec1 and 5b92935.

📒 Files selected for processing (1)
  • app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
app/web_ui/**/*.svelte

📄 CodeRabbit inference engine (.cursor/rules/project.mdc)

app/web_ui/**/*.svelte: Use Svelte v4 (not v5) for the web frontend UI
Use DaisyUI for UI components in the web frontend

Use Svelte v4 (not v5) for the web UI framework

Files:

  • app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
app/web_ui/**/*.{svelte,ts}

📄 CodeRabbit inference engine (.cursor/rules/project.mdc)

Use Tailwind CSS for styling in the web frontend

Files:

  • app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
app/web_ui/**/*.{svelte,css,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use Tailwind and DaisyUI for styling the frontend

Files:

  • app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
app/web_ui/**/*.{svelte,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Make FastAPI backend calls from the Svelte web app to communicate with backend servers

Files:

  • app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
🧠 Learnings (3)
📚 Learning: 2025-09-25T07:20:11.459Z
Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 650
File: app/web_ui/src/routes/(app)/docs/library/[project_id]/[document_id]/+page.svelte:19-19
Timestamp: 2025-09-25T07:20:11.459Z
Learning: When analyzing relative import paths in SvelteKit route structures, be more careful to verify the actual directory structure exists before suggesting corrections. The import path ../../../../generate/[project_id]/[task_id]/table_button.svelte from app/web_ui/src/routes/(app)/docs/library/[project_id]/[document_id]/+page.svelte correctly resolves to the existing file at app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/table_button.svelte.

Applied to files:

  • app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
📚 Learning: 2025-10-24T05:01:15.465Z
Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 736
File: app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/qna/qna_ui_store.ts:552-585
Timestamp: 2025-10-24T05:01:15.465Z
Learning: In the Q&A generation workflow (app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/qna/qna_ui_store.ts), chunking is intentionally only ensured for target="all" generation. When users generate Q&A for a specific document or part, they do not alter the chunking strategy; the UI only surfaces document/part-level generation actions after the user has already run the global chunking flow (or chosen to use full documents without chunking). This is by design to separate the chunking decision from per-item generation.
<!-- [/add_learning]

Applied to files:

  • app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
📚 Learning: 2025-05-09T17:33:29.787Z
Learnt from: scosman
Repo: Kiln-AI/Kiln PR: 296
File: libs/core/kiln_ai/utils/dataset_import.py:116-152
Timestamp: 2025-05-09T17:33:29.787Z
Learning: The tag_splits implementation in dataset_import.py uses deterministic allocation via integer division with remaining items assigned to the largest split, which is preferable over probability-based approaches to guarantee no splits are starved, especially with small datasets.

Applied to files:

  • app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Generate Coverage Report
  • GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
  • GitHub Check: Build Desktop Apps (macos-15-intel)
  • GitHub Check: Build Desktop Apps (windows-latest)
  • GitHub Check: Build Desktop Apps (ubuntu-22.04)
  • GitHub Check: Build Desktop Apps (macos-latest)
🔇 Additional comments (1)
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte (1)

674-676: LGTM! Fix correctly addresses the split tag synchronization bug.

Reading splits directly from saved_state.splits (the persisted source of truth) ensures that split tags are consistently applied to generated samples, even in subsequent generations. This sidesteps the reactive sync issues between guidance_data.splits and saved_state.splits.

@sfierro
Copy link
Contributor Author

sfierro commented Nov 27, 2025

Thanks for the suggestion! The nitpick about using deep comparison for the reactive sync logic is a valid improvement, but I'll keep this PR focused on the minimal fix for the reported bug. The fragile sync logic could be addressed in a follow-up PR if it causes issues in other contexts.

@sfierro sfierro closed this Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant