Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.0 Cherry Pick] [#3423] Fix unnecessary DynamoDB GET calls during LogStore::listFrom VACUUM calls #3461

Merged

Conversation

scottsand-db
Copy link
Collaborator

Cherry-pick 03bdf84 to branch 3.0

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Resolves #3423.

This PR updates the logic in BaseExternalLogStore::listFrom so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non _delta_log file is being listed.

This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the _delta_log. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the _delta_log so clearly now isn't the time to attempt to do a fixup, and (2) expensive.

This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB).

How was this patch tested?

Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky).

Does this PR introduce any user-facing changes?

No

…istFrom VACUUM calls (delta-io#3425)

#### Which Delta project/connector is this regarding?

- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

Resolves delta-io#3423.

This PR updates the logic in `BaseExternalLogStore::listFrom` so that it
does not make a request to get the latest entry from the external store
(which is used to perform recovery operations) in the event that a non
`_delta_log` file is being listed.

This is useful for VACUUM operations which may do hundreds or thousands
of list calls in the table directory and nested partition directories of
parquet files. This is NOT the `_delta_log`. Thus, checking the external
store during these list calls is (1) useless and unwanted as we are not
listing the `_delta_log` so clearly now isn't the time to attempt to do
a fixup, and (2) expensive.

This PR makes it so that future VACUUM operations do not perform
unnecessary calls to the external store (e.g. DyanamoDB).

## How was this patch tested?

Unit tests and an integration test that actually runs VACUUM and
compares the number of external store calls using the old/new logic. I
ran that test myself 50 times, too, and it passed every time (therefore,
not flaky).

## Does this PR introduce _any_ user-facing changes?

No
@scottsand-db
Copy link
Collaborator Author

Failures are unrelated to my PR. Merging.

@scottsand-db scottsand-db merged commit 0183834 into delta-io:branch-3.0 Aug 5, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants