Skip to content

feat: delete orphaned files #1958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

jayceslesar
Copy link
Contributor

Closes #1200

Rationale for this change

Ability to do more table maintenance from pyiceberg (iceberg-python?)

Are these changes tested?

Added a test!

Are there any user-facing changes?

Yes, this is a new method on the Table class.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jayceslesar, sorry for the late review.

I think this is a great start, I left some comments, let me know what you think!

@@ -1371,6 +1375,45 @@ def to_polars(self) -> pl.LazyFrame:

return pl.scan_iceberg(self)

def delete_orphaned_files(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add some options that we also have on the Java side, at a minimum:

  • older_than: Remove orphan files created before this timestamp (Defaults to 3 days). It can be that some process is writing to the table, and has some files staged to be added to the metadata tree. If we don't take this into account, it might be that these files are removed in the period between writing and committing.
  • dry_run: When true, don't actually remove files (defaults to false). I think it would be nice to return a set of the number of files removed:
Suggested change
def delete_orphaned_files(self) -> None:
def delete_orphaned_files(self) -> Set[str]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that older_than is not a table property?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a table property: https://iceberg.apache.org/docs/nightly/configuration/#table-behavior-properties

Would be great to add history.expire.max-snapshot-age-ms to this PR. We have the TableProperties class where we can add this and also the default value. We can add history.expire.min-snapshots-to-keep in a subsequent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you implying that we should treat snapshots older than that property value as orphans? Or just to add that property in the TableProperties class?

Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jayceslesar, using InpsectTable to get orphaned files to submit to the executor pool is a nice idea! Just some concerns / suggestions / debugging help 😄


from pyiceberg.io.pyarrow import _fs_from_file_path

all_known_files = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to have manifest list files here (I don't see them now). Otherwise, they'll be removed by the procedure and the table will be "corrupted".

(Related: when looking at Java tests, I noticed apache/iceberg#12957)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same goes for the current metadata JSON file, and I think to match Java behaviour we want to include all files in the metadata log of the current metadata file too.

I think there are more files we might be missing - I think tests would be nice to make sure we're not missing something! (Perhaps inspiration can be taken from the Java ones)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! I just pushed a change that will capture those, as well as the statistic file paths

as_of = datetime.now(timezone.utc) - older_than if older_than else None
all_files = [f for f in fs.get_file_info(selector) if f.type == FileType.File and (as_of is None or (f.mtime < as_of))]

orphaned_files = set(all_files).difference(all_known_files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be careful here. all_files is a list of these FileInfo objects I think but all_known_files is a set of strs. So the set difference here won't do anything because a FileInfo object won't be in a str set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch this happened in a little refactor, just need to call f.path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


from pyiceberg.io.pyarrow import _fs_from_file_path

all_known_files = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wonders whether we could expose this as a method: a public, documented inspect utility that returns all files referenced by a table. Curious what others think about whether this would be useful, I'm not fully convinced myself. (We could also then restructure orphaned file detection to use that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would likely make things simpler, inspect could use a little beefing up IMO, I came across #1626 which is a good start

Copy link
Contributor Author

@jayceslesar jayceslesar May 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am going to play around with this. It makes testing a lot easier

Copy link
Contributor Author

@jayceslesar jayceslesar May 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me know what you think about the change I just pushed -- see all_known_files. @Fokko vis as well -- this should make testing a lot easier (if I have both of your blessings here I will add tests for this function) and allow us to modify smarter going forward

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I added a few comments. ptal :)

deletes = executor.map(_delete, orphaned_files)
# exhaust
list(deletes)
logger.info(f"Deleted {len(orphaned_files)} orphaned files at {location}!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this might not necessary be always true, esp when _delete errors are suppressed.

what we do count the number of successfully deletes here? maybe _delete can return True/False for whether the delete was successful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spark procedure outputs the orphan_file_location which are all the files set to be deleted. this is pretty useful for logging
https://iceberg.apache.org/docs/nightly/spark-procedures/#output_7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just modified, let me know!


def orphaned_files(self, location: str, older_than: Optional[timedelta] = timedelta(days=3)) -> Set[str]:
"""Get all the orphaned files in the table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a sentence explaining what orphaned files mean, maybe copy/paste from https://iceberg.apache.org/docs/nightly/spark-procedures/#remove_orphan_files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1371,6 +1376,28 @@ def to_polars(self) -> pl.LazyFrame:

return pl.scan_iceberg(self)

def delete_orphaned_files(self, older_than: Optional[timedelta] = timedelta(days=3), dry_run: bool = False) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should always provide an older_than arg. this protects the orphan file deletion job from deleting recently created files that is currently waiting to be committed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


return _all_known_files

def orphaned_files(self, location: str, older_than: Optional[timedelta] = timedelta(days=3)) -> Set[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we expose this as a public function given that there's no equivalent from java/spark side? we modeled the inspect tables based on java's metadata tables.
maybe we can change this to _orphaned_files for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still need to be addressed now that this is under a new namespace?

_, _, path = _parse_location(location)
selector = FileSelector(path, recursive=True)
# filter to just files as it may return directories, and filter on time
as_of = datetime.now(timezone.utc) - older_than if older_than else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

older_than should always be present, see the above comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kevinjqliu
Copy link
Contributor

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

@jayceslesar
Copy link
Contributor Author

jayceslesar commented May 4, 2025

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

I think that makes sense -- would #1880 end up there too?

Also ideally there is a CLI that exposes all the maintenance actions too right?

I think moving things to a new OptimizeTable class in a new namespace optimize.py makes a lot of sense, can be modeled very similar to the InspectTable and generally makes things cleaner -- I think it still makes sense to have the all_known_files inside of inspect though, and can still use that in the new OptimizeTable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete orphan files
4 participants