Adding queryParamsHashId into tablePath for Snapshot/CDF/Streaming queries #679

littlegrasscao · 2025-04-09T05:45:21Z

This change updates the keying strategy of the existing pre-signed URL cache (which stores fileId -> fileUrl mappings) to account for different query shapes on the same table. Previously, the cache key only included the table path, so a new query with different parameters could overwrite existing cache entries. This update ensures correctness by incorporating the full query context into the cache key.

Query Parameters by Query Type:

Snapshot Queries:

Filters
Limit
Table version

Change Data Feed (CDF) Queries:

A Map[String, String] of CDF options, including:
- Starting version
- Ending version

Filters are excluded from the key, as files are fetched for the entire version range and filtered in memory. CDF queries also do not support limits.

Streaming Queries:

Streaming parquet files

Starting version
Ending version (If applicable)

Streaming CDF files

Same as CDF queries

Similar to CDF, filters and limits are excluded from the key, since filtering is applied in memory after fetching all relevant files.

This update prevents cache collisions between distinct query shapes on the same table and ensures accurate file-to-URL resolution for all supported query types.

client/src/main/scala/io/delta/sharing/spark/DeltaSharingSource.scala

client/src/main/scala/io/delta/sharing/spark/RemoteDeltaFileIndex.scala

client/src/main/scala/org/apache/spark/delta/sharing/PreSignedUrlCache.scala

spark/src/test/scala/io/delta/sharing/spark/DeltaSharingSourceSuite.scala

This reverts commit 85dfb47.

charlenelyu-db · 2025-04-25T09:55:57Z

client/src/main/scala/io/delta/sharing/spark/DeltaSharingSource.scala

@@ -764,11 +764,21 @@ case class DeltaSharingSource(
    // version.
    val filteredActions = fileActions.filter{ indexedFile => indexedFile.getFileAction != null }

+    // For streaming queries, build the query parameters hash using the start/end version.
+    // All the files between start and end version are cached for a tablePath.
+    val queryParamsHashId = QueryUtils.getQueryParamsHashId(startVersion, endOffset)


Shall we use the DS request param to generate the hash id instead?

(startVersion + endOffset) is not enough. The request will differ based on the value of isStartingVersion and options.readChangeFeed as well.

Maybe we could use a global variable similar to latestRefreshFunc to keep track of the hash id.

endOffset already includes isStartingVersion, but readChangeFeed is missing.

This is a good suggestion. I have added a global variable to capture query parameters. getFiles and getCDFfiles requires different inputs, so we should be safe this way.

charlenelyu-db · 2025-04-25T10:36:40Z

client/src/main/scala/io/delta/sharing/spark/RemoteDeltaFileIndex.scala

-    params.snapshotAtAnalysis.filesForScan(Nil, None, None, this)
-      .map(f => toDeltaSharingPath(f).toString)
+    val queryParamsHashId = QueryUtils.getQueryParamsHashId(
+      version = params.snapshotAtAnalysis.version


Though the chance might be very low, it's still possible that between this params.snapshotAtAnalysis.version (triggers GetTableMetadata) and the params.snapshotAtAnalysis.filesForScan below (triggers QueryTable), the latest table version has changed?

We could use the version returned from QueryTable call to generate the hash id instead.

This is a good point. I changed the logic to build hash id inside filesForScan and return a hash id string to here.

client/src/main/scala/io/delta/sharing/spark/RemoteDeltaFileIndex.scala

charlenelyu-db · 2025-04-25T10:52:10Z

client/src/main/scala/io/delta/sharing/spark/RemoteDeltaFileIndex.scala

+      partitionFilters.map(_.sql).mkString(";"),
+      dataFilters.map(_.sql).mkString(";"),


nit: Similar to above, I think it's better use the real request params to generate the hash id. partitionFilters and dataFilters might be redundant since jsonPredicateHints is included.

I removed partitionFilters and dataFilters, and add a new predicates which seem to also be generated from partitionFilters and dataFilters.

littlegrasscao changed the title ~~Adding queryParamsHashId into tablePath~~ Adding queryParamsHashId into tablePath for Snapshot/CDF/Streaming queries Apr 21, 2025

littlegrasscao marked this pull request as ready for review April 21, 2025 22:14

littlegrasscao requested review from linzhou-db and charlenelyu-db April 21, 2025 22:15

linzhou-db reviewed Apr 22, 2025

View reviewed changes

littlegrasscao mentioned this pull request Apr 23, 2025

Modify pre-signed url cache to store all refreshers for identical queries #699

Closed

Sun Cao added 12 commits April 23, 2025 16:57

Adding queryParamsHashId into tablePath

87ca5f3

adding comments

b9440db

adding config for the change

d32d916

Adding tests for snapshot and cdf queries

1444cca

add query params into snapshot inputFiles

9034c69

add integration test

6eee4fb

remove streaming query

fe63fe8

Revert "remove streaming query"

624e4b2

This reverts commit 85dfb47.

clean up

66e1fbf

clean up code

6788841

update suite name

6986478

remove startIndex from streaming hash

ebb9bc3

littlegrasscao force-pushed the sun-cao/queryparams branch from 740b089 to ebb9bc3 Compare April 24, 2025 00:14

charlenelyu-db reviewed Apr 25, 2025

View reviewed changes

Sun Cao added 2 commits April 25, 2025 12:42

Address Charlene's comment

5f256fa

update format

599f5f6

littlegrasscao requested a review from charlenelyu-db April 25, 2025 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding queryParamsHashId into tablePath for Snapshot/CDF/Streaming queries #679

Adding queryParamsHashId into tablePath for Snapshot/CDF/Streaming queries #679

littlegrasscao commented Apr 9, 2025 •

edited

Loading

charlenelyu-db Apr 25, 2025

littlegrasscao Apr 25, 2025

charlenelyu-db Apr 25, 2025

littlegrasscao Apr 25, 2025

charlenelyu-db Apr 25, 2025

littlegrasscao Apr 25, 2025

		partitionFilters.map(_.sql).mkString(";"),
		dataFilters.map(_.sql).mkString(";"),

Adding queryParamsHashId into tablePath for Snapshot/CDF/Streaming queries #679

Are you sure you want to change the base?

Adding queryParamsHashId into tablePath for Snapshot/CDF/Streaming queries #679

Conversation

littlegrasscao commented Apr 9, 2025 • edited Loading

charlenelyu-db Apr 25, 2025

Choose a reason for hiding this comment

littlegrasscao Apr 25, 2025

Choose a reason for hiding this comment

charlenelyu-db Apr 25, 2025

Choose a reason for hiding this comment

littlegrasscao Apr 25, 2025

Choose a reason for hiding this comment

charlenelyu-db Apr 25, 2025

Choose a reason for hiding this comment

littlegrasscao Apr 25, 2025

Choose a reason for hiding this comment

littlegrasscao commented Apr 9, 2025 •

edited

Loading