-
Notifications
You must be signed in to change notification settings - Fork 2k
[ENH]: Globalize dead letter queue in SysDB #6003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
e5ea351 to
ba6bd66
Compare
|
Implement Global DLQ persistence via SysDB compaction failure tracking Introduces a globally persistent compaction dead-letter queue by recording per-collection failure counts in SysDB, removing node-local in-memory tracking, and wiring compactor logic to consult the persisted state. Adds the new Key Changes• Extend Affected Areas• This summary was automatically generated by @propel-code-bot |
ba6bd66 to
0eda4dd
Compare
0eda4dd to
1881b77
Compare
| pub async fn increment_compaction_failure_count( | ||
| &self, | ||
| _collection_id: CollectionUuid, | ||
| ) -> Result<(), sqlx::Error> { | ||
| Err(sqlx::Error::Protocol( | ||
| "increment_compaction_failure_count not implemented for SQLite".into(), | ||
| )) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Requirements] This function is currently a stub that returns an error. Since the implementation for SQLite seems straightforward, it would be beneficial to implement it now to ensure consistent behavior across different environments (local/testing vs. production) and to make testing this feature easier. Here's a suggested implementation that performs an atomic update.
Context for Agents
This function is currently a stub that returns an error. Since the implementation for SQLite seems straightforward, it would be beneficial to implement it now to ensure consistent behavior across different environments (local/testing vs. production) and to make testing this feature easier. Here's a suggested implementation that performs an atomic update.
File: rust/sysdb/src/sqlite.rs
Line: 1147
Description of changes
Summarize the changes made by this PR.
compaction_failure_countcolumn on thecollectionstable to implement a global DLQ.failing_jobsmap has been removed.get_dead_jobs()endpoint has been removed as this information should be derived from the SysDB now.Test plan
How are these changes tested?
test_k8s_integration_scheduleralready tests this feature.pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
The new column in
collectionsis defaulted to 0 in the database, Go model and proto definitions.Observability plan
What is the plan to instrument and monitor this change?
The DLQ is more easily observable from the SysDB now.
compactor_job_failure_countreplaces thecompactor_dead_jobs_countmetric and we increment that on every failure instead of every time a job is "killed".Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_