Skip to content

feat: bulkload existing artifacts #8784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: feat/blobstage
Choose a base branch
from

Conversation

rjsparks
Copy link
Member

@rjsparks rjsparks commented Apr 4, 2025

No description provided.

record.len = int(content.custom_metadata["len"])
record.content_type = content_type
if explicit_mtime is not None:
record.created = explicit_mtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have circumstances where we'd update modified but not created?

From the timestamp fields' help text, I think we have store_created which is only ever set the first time a record for (store, name) is created. If, e.g., that is deleted and a new record is created for (store, name), the old store_created will stay as it was. The record is not new, but it basically represents a different file than it did before the original one was deleted.

There's created, which is the time the stored object representing its current filesystem file, came into existence.

There's modified which is the most recent time that file's contents were updated.

For the backfill, I think things are ok - we don't have a real "created" time for the file we're pulling in from the filesystem, so we use its mtime. But suppose later we change the file and save the new copy to the blob store. In that case, I think we want to leave both store_created and created alone but update modified. (If not, then why do we have both created and modified?)

If we do need both, then I think we need to expose them on the MetadataFile so that we can make them equal for backfilling but in general separately set the mtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean would we ever override modified but not created (vs taking defaults for both)?
No, I don't think we would.

The only reason we have the ability to override mtime is so that we can tell other users of the store, particularly rsync, what the mtime of the file should be - so that we can load the millions of already existing artifacts into the store.

When we do that, we are setting created and modified to that same overridden value because its the best we can do.

Consider an artifact that we might update in 2028 (under the current ideas of what's acceptable to change), such as the html presentation of a draft. Assume we've run the bulk import in 2025, and the draft html was originally from 2020.

Before the update, store_created would be in 2025, created and modified would be equal and in 2020.
After the update, store_created would be in 2025, created would be in 2020, and modified would be in 2028. That's in the StoredObject table. The blobstore itself will know its own store_created date, and would have modified (but currrently not created) in the custom metadata as mtime.

How are these going to get used? Things (rsync, hopefully the web service) should be presenting the file with the mtime from the custom metadata. They won't know without asking the datatracker whether that object had some history (we are not currently planning to use versioned blobstores). But that's ok as the datatracker is the authoritative store for the history of the objects.

At the datatracker, I don't see any immediate use for store_created other than debugging and monitoring (making sure the store behaves as we believe it will behave). Created is a piece of information we otherwise don't necessarily have in the datatracker. For many objects, there'll be a corresponding DocEvent, but not for all, so its an opportunity to keep track of when things started to be that we otherwise would lose. modified will likely be directly useful for maintaining datasets that can be updated (like the html presentations) and for awhile it will likely be useful to stitching algorithms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And when not backfilling, I don't see any use cases where we would explicitly provide an mtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And I think we want to maintain the ability to import a new puddle of objects in the future that came with curated past dates).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that all makes sense. My worry is that we're introducing possibly surprising behavior in the handling of mtime. I worry that some future user (us or otherwise) will try to update the 'modified' time for a bunch of files by setting mtime and be surprised that it also overwrote the created. (I.e., are we setting ourselves up for a future head slap moment when we re-discover the analog of what we were just reminded about with the ctime business?)

However, the pattern would be to introduce (yet another) MetadataFile-like class. Given that we don't know when we'd use it, I'll be happy enough if we just document the assumption here about how we're using mtime. The code here is boilerplate-y, so I think it'd be worth a docstring for the save method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing we should be careful about is that the BlobdbStorage class returns files as BlobFile instances which are subclasses of MetadataFile and have mtime set. If we were, say, to refactor the submit process to keep submitted files directly in a BlobdbStorage we might end up writing post_submission() code that in sketch is like

submitted_file = submit_store.open(...)  # has mtime set
draft_store.save(submitted_file, ...)  # will trigger the explicit mtime behavior

This leaves me thinking we might want to go to the trouble of special fields to trigger the explicit timestamp overwrites.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While digging around, I see that the Storage interface has a number of other methods (get_accessed_time(), get_created_time(), get_modified_time(), etc) we haven't looked at. Might be worth exploring those instead of inventing our own wheel (otoh, they might bring with them baggage we don't want)

@jennifer-richards jennifer-richards changed the base branch from feat/blobstage to main April 7, 2025 20:42
@jennifer-richards jennifer-richards changed the base branch from main to feat/blobstage April 7, 2025 20:42
@rjsparks rjsparks marked this pull request as draft July 25, 2025 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants