Skip to content

manifest: rework blob file reference counting #4806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 5, 2025

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Jun 2, 2025

Rework the reference counting of in-use blob files to decouple the reference counts from the TableMetadatas' blob references. With the introduction of blob file rewrites and replacement, the physical blob file and the logical blob file's references have distinct lifetimes. A physical blob file that has been replaced will need to be removed once all Versions that predate the replacement have been unreferenced.

This commit renames the struct previously known as BlobFileMetadata to PhysicalBlobFile, describing metadata particular to a physical file backing a logical blob file. A new BlobFileMetadata struct is introduced that holds a BlobFileID and a pointer to the PhysicalBlobFile.

This commit additionally adapts the physical blob file referencing to occur through a separate B-Tree of BlobFileMetadata structs. Similar to TableMetadata, references to blob files are maintained by copy-on-write B-Tree nodes which themselves are reference counted. A PhysicalBlobFile's reference count is incremented when a new B-Tree node references a containing BlobFileMetadata, and it's decremented when the B-Tree node's reference count falls to zero. This indirection ensures that a mutation to the set of blob files only performs log(n) work. Since addition and removal of blob files within a version is now modeled directly (as opposed to indirectly via TableMetadata BlobReferences), a blob file may be removed or replaced within the set before all referencing TableMetadata are removed.

Informs #4802.

@jbowens jbowens requested a review from a team as a code owner June 2, 2025 21:08
@jbowens jbowens requested a review from sumeerbhola June 2, 2025 21:08
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens jbowens force-pushed the blob-refcounts branch 5 times, most recently from 751e1d0 to ae59698 Compare June 3, 2025 20:53
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an initial pass.

Reviewed 2 of 2 files at r2.
Reviewable status: 2 of 18 files reviewed, 14 unresolved discussions


internal/manifest/version_edit.go line 166 at r2 (raw file):

	// While replaying a MANIFEST, the values are nil. Otherwise the values must
	// not be nil.
	DeletedBlobFiles map[base.BlobFileID]*PhysicalBlobFile

I'm getting confused here on why PhysicalBlobFile is sufficient for NewBlobFiles.

When we create an original blob the PhysicalBlobFile is sufficient since the only additional information provided by the BlobFileMetadata is the FileID which is the same as in the PhysicalBlobFile. But when we rewrite an original or an already rewritten blob file, we need the BlobFileMetadata of the new blob, since we need to preserve that mapping from the original FileID as all the sstables that are not in this version edit use that original FileID.

For deleting a blob file it seems the PhysicalBlobFile is sufficient, since a physical blob file's FileID is never referenced by more than one original blob file FileID. Though perhaps it would be easier to have the full BlobFileMetadata to quickly check there are no dangling references.


internal/manifest/blob_metadata.go line 25 at r2 (raw file):

// A BlobReference describes a sstable's reference to a blob value file.
type BlobReference struct {
	// FileID identifies the referenced blob file.

Consider saying that a BlobReference is immutable. And that this is always the original FileID, i.e., if a value reference gets reused in a compaction it continues to reference the original FileID, even if that FileID has already been rewritten.


internal/manifest/blob_metadata.go line 36 at r2 (raw file):

	// OriginalMetadata is the metadata for the original physical blob file. It
	// is non-nil for blob references contained within active Versions. It is

I am unsure about this "is non-nil" comment, that is inherited from the existing code. I don't see us setting BlobReference.Metadata to nil in TableMetadata.Unref. So isn't it always non-nil?


internal/manifest/blob_metadata.go line 48 at r2 (raw file):

	// TODO(jackson): We should either remove the OriginalMetadata from the
	// BlobReference or use it to infer when a blob file has definitely NOT been
	// replaced and a lookup in Version.BlobFileSet is unnecessary.

I suppose the interpolation we do in BlobReference.EstimatedPhysicalSize can continue to use the original metadata without translation? Is that the only reason this exists? Given our in-person discussion today, I'd lean towards removing this and recovering any performance loss in other ways.


internal/manifest/blob_metadata.go line 79 at r2 (raw file):

// BlobFileMetadata encapsulates a blob file ID used to identify a particular
// blob file, and a reference-counted physical blob file. See the BlobFileSet
// documentation for more details.

Consider saying that different versions can have different BlobFileMetadata structs with the same BlobFileMetadata.FileID, and if so they necessarily refer to different PhysicalBlobFiles.


internal/manifest/blob_metadata.go line 82 at r2 (raw file):

type BlobFileMetadata struct {
	// FileID is the file ID used to identify the blob file within
	// BlobReferences.

The phrase "used to identify the blob file within BlobReferences" could be made more precise. Consider saying these are always original blob file ids, i.e., the same domain as BlobReference.FileID.


internal/manifest/blob_metadata.go line 87 at r2 (raw file):

	//
	// If the blob file has been replaced, Physical.FileNum ≠ FileID.
	Physical *PhysicalBlobFile

Always non-nil, yes?


internal/manifest/blob_metadata.go line 103 at r2 (raw file):

// contains one.
func (m BlobFileMetadata) Ref() {
	if m.Physical != nil {

why should this ever be nil?


internal/manifest/blob_metadata.go line 112 at r2 (raw file):

// the provided obsolete files set.
func (m BlobFileMetadata) Unref(of ObsoleteFilesSet) {
	if m.Physical != nil {

ditto


internal/manifest/blob_metadata.go line 312 at r2 (raw file):

// entries remap a blob file ID to a new blob file. This remapping allows for
// replacement of a physical blob file without replacing the referencing
// TableMetadatas.

Consider saying

  • the key is BlobFileMetadata.FileID.
  • whether it is a remapping or not is hidden inside BlobFileMetadata which follows the same structure for both.

internal/manifest/blob_metadata.go line 341 at r2 (raw file):

	// Version's B-Tree should maintain a reference. So a call to remove()
	// should never result in a file's reference count dropping to zero, and we
	// pass assertNoObsoleteFiles{} to assert such.

This is because this is only happening as part of creating the new Version so we are holding a ref to the current (soon to be previous) Version, yes? A comment like this in BulkVersionEdit.Apply is easier to grasp since the context is already there, but providing a bit more context here since this is a standalone function would be helpful.


internal/manifest/blob_metadata.go line 385 at r2 (raw file):

	// files is a map of blob file IDs to *currentBlobFiles, recording metadata
	// about the blob file's active references in the latest Version.
	files map[base.BlobFileID]*currentBlobFile

Worth saying that the key is currentBlobFile.metadata.FileID, so original blob file ids.
Given how often we are saying this, I wonder if we should make it a different type.


internal/manifest/blob_metadata.go line 391 at r2 (raw file):

type currentBlobFile struct {
	metadata BlobFileMetadata

Is this a copy of the struct in the btree for the latest version?


internal/manifest/table_metadata.go line 232 at r2 (raw file):

	if v := m.refs.Add(1); v == 1 {
		m.TableBacking.Ref()
		for _, blobRef := range m.BlobReferences {

I see why this is removed since the BlobFileSet refs on its behalf as the same TableMetadata can ref different blob files in different versions.
Do we do a invariants.Enabled consistency check somewhere that a version's BlobFileSet has exactly the FileIDs that are referenced in the TableMetadatas in that version?

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 0 of 19 files reviewed, 14 unresolved discussions (waiting on @sumeerbhola)


internal/manifest/blob_metadata.go line 25 at r2 (raw file):

Previously, sumeerbhola wrote…

Consider saying that a BlobReference is immutable. And that this is always the original FileID, i.e., if a value reference gets reused in a compaction it continues to reference the original FileID, even if that FileID has already been rewritten.

Done.


internal/manifest/blob_metadata.go line 36 at r2 (raw file):

Previously, sumeerbhola wrote…

I am unsure about this "is non-nil" comment, that is inherited from the existing code. I don't see us setting BlobReference.Metadata to nil in TableMetadata.Unref. So isn't it always non-nil?

This is saying it's non-nil in versions. In the next sentence we describe when it may be nil: decoding a version edit during manifest replay.


internal/manifest/blob_metadata.go line 48 at r2 (raw file):

Previously, sumeerbhola wrote…

I suppose the interpolation we do in BlobReference.EstimatedPhysicalSize can continue to use the original metadata without translation? Is that the only reason this exists? Given our in-person discussion today, I'd lean towards removing this and recovering any performance loss in other ways.

Yeah, we can lift the EstimatedPhysicalSize up onto the BlobReference. I'll follow up with that (and the logic to actually perform the look up in the BTreeSet) in a subsequent PR


internal/manifest/blob_metadata.go line 79 at r2 (raw file):

Previously, sumeerbhola wrote…

Consider saying that different versions can have different BlobFileMetadata structs with the same BlobFileMetadata.FileID, and if so they necessarily refer to different PhysicalBlobFiles.

Done.


internal/manifest/blob_metadata.go line 82 at r2 (raw file):

Previously, sumeerbhola wrote…

The phrase "used to identify the blob file within BlobReferences" could be made more precise. Consider saying these are always original blob file ids, i.e., the same domain as BlobReference.FileID.

Done.


internal/manifest/blob_metadata.go line 87 at r2 (raw file):

Previously, sumeerbhola wrote…

Always non-nil, yes?

Yeah, added a sentence.


internal/manifest/blob_metadata.go line 103 at r2 (raw file):

Previously, sumeerbhola wrote…

why should this ever be nil?

Removed, this was detritus from an earlier version of the code


internal/manifest/blob_metadata.go line 112 at r2 (raw file):

Previously, sumeerbhola wrote…

ditto

Done.


internal/manifest/blob_metadata.go line 312 at r2 (raw file):

Previously, sumeerbhola wrote…

Consider saying

  • the key is BlobFileMetadata.FileID.
  • whether it is a remapping or not is hidden inside BlobFileMetadata which follows the same structure for both.

Done.

The language here was stale from a previous iteration that structured remappings differently


internal/manifest/blob_metadata.go line 341 at r2 (raw file):

Previously, sumeerbhola wrote…

This is because this is only happening as part of creating the new Version so we are holding a ref to the current (soon to be previous) Version, yes? A comment like this in BulkVersionEdit.Apply is easier to grasp since the context is already there, but providing a bit more context here since this is a standalone function would be helpful.

Done.


internal/manifest/blob_metadata.go line 385 at r2 (raw file):

Previously, sumeerbhola wrote…

Worth saying that the key is currentBlobFile.metadata.FileID, so original blob file ids.
Given how often we are saying this, I wonder if we should make it a different type.

the BlobFileID type is only ever used as the stable, original blob file id. When we're referring to a physical blob file, we always use base.DiskFileNum


internal/manifest/blob_metadata.go line 391 at r2 (raw file):

Previously, sumeerbhola wrote…

Is this a copy of the struct in the btree for the latest version?

Yeah, added a comment.


internal/manifest/table_metadata.go line 232 at r2 (raw file):

Previously, sumeerbhola wrote…

I see why this is removed since the BlobFileSet refs on its behalf as the same TableMetadata can ref different blob files in different versions.
Do we do a invariants.Enabled consistency check somewhere that a version's BlobFileSet has exactly the FileIDs that are referenced in the TableMetadatas in that version?

We didn't--I added one in a new commit. I needed to genericize some more of the B-Tree impl to allow the iteration over the BlobFileSet to do it.


internal/manifest/version_edit.go line 166 at r2 (raw file):

Previously, sumeerbhola wrote…

I'm getting confused here on why PhysicalBlobFile is sufficient for NewBlobFiles.

When we create an original blob the PhysicalBlobFile is sufficient since the only additional information provided by the BlobFileMetadata is the FileID which is the same as in the PhysicalBlobFile. But when we rewrite an original or an already rewritten blob file, we need the BlobFileMetadata of the new blob, since we need to preserve that mapping from the original FileID as all the sstables that are not in this version edit use that original FileID.

For deleting a blob file it seems the PhysicalBlobFile is sufficient, since a physical blob file's FileID is never referenced by more than one original blob file FileID. Though perhaps it would be easier to have the full BlobFileMetadata to quickly check there are no dangling references.

Yeah, this commit is constrained to just reworking the reference counting. I have a wip commit that will disentangle the BlobFileID and DiskFileNum in the tagNewBlobFile version edit record.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 6 files at r5, 15 of 18 files at r6, 3 of 3 files at r7, all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @jbowens)

jbowens added 4 commits June 5, 2025 17:20
Genericize the B-Tree iterator so it may be used with the upcoming blob file
B-Tree.
Add an All method that returns a Go iterator over the B-Tree's items.
Rework the reference counting of in-use blob files to decouple the reference
counts from the TableMetadatas' blob references. With the introduction of blob
file rewrites and replacement, the physical blob file and the logical blob
file's references have distinct lifetimes. A physical blob file that has been
replaced will need to be removed once all Versions that predate the replacement
have been unreferenced.

This commit renames the struct previously known as BlobFileMetadata to
PhysicalBlobFile, describing metadata particular to a physical file backing a
logical blob file. A new BlobFileMetadata struct is introduced that holds a
BlobFileID and a pointer to the PhysicalBlobFile.

This commit additionally adapts the physical blob file referencing to occur
through a separate B-Tree of BlobFileMetadata structs. Similar to
TableMetadata, references to blob files are maintained by copy-on-write B-Tree
nodes which themselves are reference counted. A PhysicalBlobFile's reference
count is incremented when a new B-Tree node references a containing
BlobFileMetadata, and it's decremented when the B-Tree node's reference count
falls to zero. This indirection ensures that a mutation to the set of blob
files only performs log(n) work. Since addition and removal of blob files
within a version is now modeled directly (as opposed to indirectly via
TableMetadata BlobReferences), a blob file may be removed or replaced within
the set before all referencing TableMetadata are removed.

Informs cockroachdb#4802.
Sometimes in invariants builds, validate that the set of unique BlobFileIDs
contained in TableMetadata BlobReferences is exactly the set of BlobFileIDs in
the Version's BlobFileSet.
@jbowens
Copy link
Collaborator Author

jbowens commented Jun 5, 2025

TFTR!

@jbowens jbowens merged commit b10c0e8 into cockroachdb:master Jun 5, 2025
6 checks passed
@jbowens jbowens deleted the blob-refcounts branch June 5, 2025 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants