Skip to content

[ENH]: Plumb prefix path all the way to the bf writer #4743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: 05-30-_enh_return_database_id_in_get_collections_call_from_sysdb
Choose a base branch
from

Conversation

sanketkedia
Copy link
Contributor

@sanketkedia sanketkedia commented Jun 3, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Plumbs the tenant and database id from compactor orchestrator to the segment writer.
    • The segment writer constructs the prefix path as per format: tenant/{tenant}/database/{database_id}/collection/{collection_id}/segment/{segment_id}
    • It then passes this prefix in the construction of various blockfile writers as part of the write options
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

None

Copy link
Contributor Author

sanketkedia commented Jun 3, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@sanketkedia sanketkedia marked this pull request as ready for review June 3, 2025 23:27
Copy link
Contributor

propel-code-bot bot commented Jun 3, 2025

Add Explicit Blockfile Prefix Path Plumbed from Tenant/Database to Segment Writers

This PR refactors the blockfile writer creation path throughout the Chroma codebase, ensuring an explicit and consistent prefix path is constructed and passed to all blockfile writers. The prefix is now composed with tenant, database ID, collection ID, and segment ID, which enables clearer S3/object store layout by partitioning files by tenant/db. This change updates constructors, signatures, and call chains for segment writers, compactor orchestrators, blockfile providers, and all associated tests, enforcing the prefix as required context.

Key Changes:
• All blockfile writer instantiations now require and use an explicit prefix_path, composed as tenant/{tenant}/database/{database_id}/collection/{collection_id}/segment/{segment_id}
• Segment writers (metadata, record, vector) and orchestrator logic updated to plumb tenant and database_id all the way down to blockfile writer creation
• BlockfileWriterOptions no longer has a default constructor; callers must provide a prefix_path string for every writer
• Tests, benchmarks, and concurrency checks updated to comply with new constructor and required prefix path
• DatabaseUuid added as a distinct type; test helpers populated database_id in test segments
• Best practice for data partitioning by multi-tenancy implemented for object store hygiene and scalability

Affected Areas:
• rust/segment (metadata, record, and test)
• rust/blockstore (writer options, blockfile, ordered/unordered writers, benchmarks, tests, concurrency)
• rust/worker (orchestration, filter, count records, prefetch segment operators)
• rust/types (Collection/DatabaseUuid)
• rust/index (metadata/fulltext/spann types)

Potential Impact:

Functionality: No change to the logical behavior of segment writers; improved ability to partition/locate blockfiles by tenant/db/collection/segment; potential for cleaner deletes and better S3/object store organization.

Performance: No direct impact; minor overhead in constructing and passing prefixes, possible improvement in cleanup and batch operations.

Security: More explicit separation of tenant and database storage, reducing accidental sharing or key collisions.

Scalability: Enables better scaling for multi-tenant systems by partitioning files and isolating data blobs; prepares base for sharding and cleaner data management.

Review Focus:
• Look for places where blockfile writers are constructed; confirm that they always receive the intended prefix path
• Check for signature changes or call chain alignment issues, particularly when plumbing tenant/database_id
• Ensure test fixtures and helper initializers are adapted throughout
• Verify possible downstream consumers of BlockfileWriterOptions are compatible

Testing Needed

• All segment writers, orchestrators, and test suites should pass with the new required prefix_path
• Verify spawn and fork logic for blockfile writers in multithreaded/concurrent settings
• Object store/S3 layouts for written blockfiles should reflect the deep prefix structure
• Check new unique DatabaseUuid logic in affected tests

Code Quality Assessment

rust/blockstore/src/types/writer_options.rs: Enforces explicit and non-default construction; ensures all call sites must opt-in and acknowledge use of prefix_path.

rust/segment/src/blockfile_{record,metadata}.rs: Updated methods and struct signatures to require tenant/database_id; only forwards context, no logic change.

rust/types/src/collection.rs: Introduces DatabaseUuid as a dedicated typed ID; follows best practice for identifier separation.

tests/benchmarks: Fixture and synthetic data providers now build test segments with new prefix and database id.

Best Practices

Multi Tenancy:
• Strong separation of tenant/database paths in physical object store
• No implicit global defaults anywhere in writer instantiation

Constructor Design:
• Elimination of default constructors that could hide context requirements
• Explicit forced context propagation through plumbing

Potential Issues

• Callers not updated to provide prefix_path will fail at compile time (intentional, expected enforcement).
• If prefixes are misconstructed (wrong tenant/database), downstream storage hygiene could be broken.
• Edge-case tests for legacy/default behaviors will need explicit updates.
• Compatibility with legacy blockfiles written with no prefix may not be directly broken, but new writers will not use them unless explicitly pointed.

This summary was automatically generated by @propel-code-bot

}

impl BlockfileWriterOptions {
pub fn new() -> Self {
Self::default()
pub fn new(prefix_path: String) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is intentional to force all callers to necessarily think about and set a prefix path. There is no default constructor now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant