Skip to content

Adding queryParamsHashId into tablePath for Snapshot/CDF/Streaming queries #679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

littlegrasscao
Copy link
Collaborator

@littlegrasscao littlegrasscao commented Apr 9, 2025

This change updates the keying strategy of the existing pre-signed URL cache (which stores fileId -> fileUrl mappings) to account for different query shapes on the same table. Previously, the cache key only included the table path, so a new query with different parameters could overwrite existing cache entries. This update ensures correctness by incorporating the full query context into the cache key.

Query Parameters by Query Type:

Snapshot Queries:

  • Filters
  • Limit
  • Table version

Change Data Feed (CDF) Queries:

  • A Map[String, String] of CDF options, including:
    • Starting version
    • Ending version

Filters are excluded from the key, as files are fetched for the entire version range and filtered in memory. CDF queries also do not support limits.

Streaming Queries:

Streaming parquet files

  • Starting version
  • Ending version (If applicable)

Streaming CDF files

  • Same as CDF queries

Similar to CDF, filters and limits are excluded from the key, since filtering is applied in memory after fetching all relevant files.

This update prevents cache collisions between distinct query shapes on the same table and ensures accurate file-to-URL resolution for all supported query types.

@littlegrasscao littlegrasscao changed the title Adding queryParamsHashId into tablePath Adding queryParamsHashId into tablePath for Snapshot/CDF/Streaming queries Apr 21, 2025
@littlegrasscao littlegrasscao marked this pull request as ready for review April 21, 2025 22:14
@@ -764,11 +764,21 @@ case class DeltaSharingSource(
// version.
val filteredActions = fileActions.filter{ indexedFile => indexedFile.getFileAction != null }

// For streaming queries, build the query parameters hash using the start/end version.
// All the files between start and end version are cached for a tablePath.
val queryParamsHashId = QueryUtils.getQueryParamsHashId(startVersion, endOffset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use the DS request param to generate the hash id instead?

(startVersion + endOffset) is not enough. The request will differ based on the value of isStartingVersion and options.readChangeFeed as well.

Maybe we could use a global variable similar to latestRefreshFunc to keep track of the hash id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endOffset already includes isStartingVersion, but readChangeFeed is missing.

This is a good suggestion. I have added a global variable to capture query parameters. getFiles and getCDFfiles requires different inputs, so we should be safe this way.

params.snapshotAtAnalysis.filesForScan(Nil, None, None, this)
.map(f => toDeltaSharingPath(f).toString)
val queryParamsHashId = QueryUtils.getQueryParamsHashId(
version = params.snapshotAtAnalysis.version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the chance might be very low, it's still possible that between this params.snapshotAtAnalysis.version (triggers GetTableMetadata) and the params.snapshotAtAnalysis.filesForScan below (triggers QueryTable), the latest table version has changed?

We could use the version returned from QueryTable call to generate the hash id instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I changed the logic to build hash id inside filesForScan and return a hash id string to here.

Comment on lines 209 to 210
partitionFilters.map(_.sql).mkString(";"),
dataFilters.map(_.sql).mkString(";"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Similar to above, I think it's better use the real request params to generate the hash id. partitionFilters and dataFilters might be redundant since jsonPredicateHints is included.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed partitionFilters and dataFilters, and add a new predicates which seem to also be generated from partitionFilters and dataFilters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants