-
Notifications
You must be signed in to change notification settings - Fork 456
Open
Description
Required Pre-requisites
- I have read the Documentation
- I have searched the Issue Tracker and Discussions that this hasn't been reported yet.
- Consider asking in Discussions first
Motivation
Motivation: Create a robust evaluation framework with three distinct difficulty levels of synthetic data generation to comprehensively test the user intent search pipeline's performance across various real-world scenarios.
Proposed Solution
🛠️ Proposed Solution
Implement a three-tier synthetic data generation system with progressive difficulty levels:
Core Implementation:
-
Three Prompt Templates (
prompt_easy,prompt_medium,prompt_hard)- Easy: Direct app mentions + clear goals
- Medium: App mentions + business context, no explicit function terms
- Hard: Implicit/contextual language, optional app mentions
-
Graduated Complexity Testing
- Baseline performance (easy) → Real-world scenarios (medium) → Edge cases (hard)
- Systematic evaluation across user communication styles
-
Enhanced Evaluation Pipeline
- Custom dataset naming by difficulty level
- Comparative performance analysis
- Comprehensive metrics tracking per tier
Outcome:
A systematic evaluation framework that provides data-driven insights into search pipeline performance across realistic user complexity scenarios, enabling targeted improvements and benchmarking.
Metadata
Metadata
Assignees
Labels
No labels