KIL-257: Fix tags not applied to synth data in subsequent generations#862
KIL-257: Fix tags not applied to synth data in subsequent generations#862
Conversation
The get_random_split_tag() function was reading from guidance_data.splits, which can get out of sync with saved_state.splits due to the reactive sync logic using object reference comparison. This caused split tags to not be applied when generating more outputs after an initial save. Fixed by reading splits directly from saved_state (the persisted source of truth) instead of guidance_data.splits, aligning with how the QnA page handles splits.
WalkthroughThe pull request modifies the synth generation page to read random split selections from Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~5–10 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📊 Coverage ReportOverall Coverage: 92% Diff: origin/main...HEADNo lines with coverage information in this diff.
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte (1)
79-93: Consider deep comparison or restructuring the reactive sync logic.The reactive statements use object reference comparison (
!==), which fails to detect changes when both stores reference the same object—this was the root cause of the bug fixed in this PR. While the current fix on line 676 works around this by reading directly fromsaved_state, the sync logic remains fragile for future modifications.Consider one of these approaches:
- Use deep comparison (e.g., JSON.stringify) instead of reference comparison
- Restructure to have a single source of truth without two-way sync
- Use immutable updates when modifying splits to ensure new object references
- $: if (is_setup && $saved_state.splits !== $splits) { + $: if (is_setup && JSON.stringify($saved_state.splits) !== JSON.stringify($splits)) { saved_state.update((s) => ({ ...s, splits: $splits, })) }$: if ( $saved_state.splits && Object.keys($saved_state.splits).length > 0 && - $saved_state.splits !== $splits + JSON.stringify($saved_state.splits) !== JSON.stringify($splits) ) { guidance_data.splits.set($saved_state.splits) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
app/web_ui/**/*.svelte
📄 CodeRabbit inference engine (.cursor/rules/project.mdc)
app/web_ui/**/*.svelte: Use Svelte v4 (not v5) for the web frontend UI
Use DaisyUI for UI components in the web frontendUse Svelte v4 (not v5) for the web UI framework
Files:
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
app/web_ui/**/*.{svelte,ts}
📄 CodeRabbit inference engine (.cursor/rules/project.mdc)
Use Tailwind CSS for styling in the web frontend
Files:
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
app/web_ui/**/*.{svelte,css,ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use Tailwind and DaisyUI for styling the frontend
Files:
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
app/web_ui/**/*.{svelte,ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Make FastAPI backend calls from the Svelte web app to communicate with backend servers
Files:
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
🧠 Learnings (3)
📚 Learning: 2025-09-25T07:20:11.459Z
Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 650
File: app/web_ui/src/routes/(app)/docs/library/[project_id]/[document_id]/+page.svelte:19-19
Timestamp: 2025-09-25T07:20:11.459Z
Learning: When analyzing relative import paths in SvelteKit route structures, be more careful to verify the actual directory structure exists before suggesting corrections. The import path ../../../../generate/[project_id]/[task_id]/table_button.svelte from app/web_ui/src/routes/(app)/docs/library/[project_id]/[document_id]/+page.svelte correctly resolves to the existing file at app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/table_button.svelte.
Applied to files:
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
📚 Learning: 2025-10-24T05:01:15.465Z
Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 736
File: app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/qna/qna_ui_store.ts:552-585
Timestamp: 2025-10-24T05:01:15.465Z
Learning: In the Q&A generation workflow (app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/qna/qna_ui_store.ts), chunking is intentionally only ensured for target="all" generation. When users generate Q&A for a specific document or part, they do not alter the chunking strategy; the UI only surfaces document/part-level generation actions after the user has already run the global chunking flow (or chosen to use full documents without chunking). This is by design to separate the chunking decision from per-item generation.
<!-- [/add_learning]
Applied to files:
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
📚 Learning: 2025-05-09T17:33:29.787Z
Learnt from: scosman
Repo: Kiln-AI/Kiln PR: 296
File: libs/core/kiln_ai/utils/dataset_import.py:116-152
Timestamp: 2025-05-09T17:33:29.787Z
Learning: The tag_splits implementation in dataset_import.py uses deterministic allocation via integer division with remaining items assigned to the largest split, which is preferable over probability-based approaches to guarantee no splits are starved, especially with small datasets.
Applied to files:
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Generate Coverage Report
- GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
- GitHub Check: Build Desktop Apps (macos-15-intel)
- GitHub Check: Build Desktop Apps (windows-latest)
- GitHub Check: Build Desktop Apps (ubuntu-22.04)
- GitHub Check: Build Desktop Apps (macos-latest)
🔇 Additional comments (1)
app/web_ui/src/routes/(app)/generate/[project_id]/[task_id]/synth/+page.svelte (1)
674-676: LGTM! Fix correctly addresses the split tag synchronization bug.Reading splits directly from
saved_state.splits(the persisted source of truth) ensures that split tags are consistently applied to generated samples, even in subsequent generations. This sidesteps the reactive sync issues betweenguidance_data.splitsandsaved_state.splits.
|
Thanks for the suggestion! The nitpick about using deep comparison for the reactive sync logic is a valid improvement, but I'll keep this PR focused on the minimal fix for the reported bug. The fragile sync logic could be addressed in a follow-up PR if it causes issues in other contexts. |
Summary
Fixes a bug where tags (split tags) were not being applied to synthetic data runs when generating more inputs/outputs after an initial save.
Problem
When using the synthetic data generation flow:
Save all→ tags applied correctly ✓Generate Inputs, generate more, thenGenerate Outputs, thenSave all→ tags NOT applied ✗Root Cause
The
get_random_split_tag()function was reading splits fromguidance_data.splits, which is a separate store fromsaved_state.splits. These two stores are supposed to be synced via reactive statements, but the sync logic uses object reference comparison (!==) which fails when both stores reference the same object - causing the sync condition to always be false.Fix
Updated
get_random_split_tag()to read splits directly fromsaved_state.splits(the persisted source of truth) instead ofguidance_data.splits. This aligns with how the QnA page handles splits.Testing
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.