Skip to content

Support Directory Based Access #782

@zhu-tom

Description

@zhu-tom

This is a proposal to support directory based access in the Delta Sharing Protocol.

Motivation

The Delta Sharing protocol currently grants temporary access to a tables file via its QueryTable API. This operation can be expensive, as the server needs to unpack the delta log, discover parquet data files needed for the query, and generate pre-signed urls for them. A petabyte scale table can have millions of data files, which can put significant load on the server and cause query performance to suffer as well.

Similar in spirit to UC OSS API for GenerateTemporaryTableCredentials, we would like to support sharing tables with Cloud Tokens, which are directory (prefix) based STS tokens that grant temporary read access to the table’s root directory. This approach bypasses the pre-signing workflow, and instead provides direct read only access to the table. The query engines that are capable of processing the delta log get direct access to it, and can optimize query performance by leveraging their custom metadata optimizations, caching and distributed metadata processing.

We propose to add directory based access to the delta sharing protocol to enrich the open sharing ecosystem further.

Protocol Changes

The table object returned by the Delta Sharing APIs will now contain metadata regarding the table root storage location and supported access modes, along with any auxiliary locations.

List all Tables in a Share

HTTP Request Value
Method GET
URL {prefix}/shares/{share}/all-tables
200: The tables were successfully returned.
HTTP Response Value
Header

Content-Type: application/json; charset=utf-8

Body
{
  "items": [
    {
      "name": "string",
      "schema": "string",
      "share": "string",
      "shareId": "string",
      "id": "string",
      "location": "{scheme}://some/path/to/table",
      "auxiliaryLocations": [
        "{scheme}://some/path/1",
        "{scheme}://some/path/2"
      ],
      "accessModes": ["url","dir"]
    }
  ],
  "nextPageToken": "string"
}

Note: location should point to the root directory of the table.

Note: auxiliaryLocations is an optional field which represents any auxiliary storage locations for the table. These should be supported in the auxiliaryLocation field of the Generate Temporary Table Credential request body

Note: accessModes represents the supported access modes for the table. This can be url, dir, or both. If url is present, the QueryTable endpoint should be implemented for the table. If dir is present, the GenerateTemporaryTableCredential endpoint should be implemented for the table.

List Tables in a Schema

HTTP Request Value
Method GET
URL {prefix}/shares/{share}/schemas/{schema}/tables
200: The tables were successfully returned.
HTTP Response Value
Header

Content-Type: application/json; charset=utf-8

Body
{
  "items": [
    {
      "name": "string",
      "schema": "string",
      "share": "string",
      "shareId": "string",
      "id": "string",
      "location": "{scheme}://some/path/to/table",
      "auxiliaryLocations": [
        "{scheme}://some/path/1",
        "{scheme}://some/path/2"
      ],
      "accessModes": ["url","dir"]
    }
  ],
  "nextPageToken": "string"
}

Note: location should point to the root directory of the table.

Note: auxiliaryLocations is an optional field which represents any auxiliary storage locations for the table. These should be supported in the auxiliaryLocation field of the Generate Temporary Table Credential request body

Note: accessModes represents the supported access modes for the table. This can be url, dir, or both. If url is present, the QueryTable endpoint should be implemented for the table. If dir is present, the GenerateTemporaryTableCredential endpoint should be implemented for the table.

Query Table Metadata

When the client requests QueryTableMetadata for a table with accessModes containing dir, the server must support directory based access for the table and QueryTableMetadata must return the location of the table for directory based access. In the case that the client does not support directory based access, this field is optional. However, we recommend that this field be included to support recipients with network restrictions to allow these locations to be accessed. auxiliaryLocations is an optional field which represents any auxiliary storage locations for the table. These should be supported in the auxiliaryLocation field of the Generate Temporary Table Credential request body. accessModes represents the supported access modes for the table. This can be url, dir, or both. If url is present, the QueryTable endpoint should be implemented for the table. If dir is present, the GenerateTemporaryTableCredential endpoint should be implemented for the table.

Parquet

{
  "protocol": {
    "minReaderVersion": 1
  }
}
{
  "metaData": {
    "id": "f8d5c169-3d01-4ca3-ad9e-7dc3355aedb2",
    "location": "{scheme}://some/path/to/table",
    "auxiliaryLocations": [
       "{scheme}://some/path/1",
       "{scheme}://some/path/2"
    ],
    "accessModes": ["url","dir"],
    "format": {
      "provider": "parquet"
    },
    "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"eventTime\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"date\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}",
    "partitionColumns": [
      "date"
    ]
  }
}

Delta

{
  "metaData": {
    "version": 20,
    "size": 123456,
    "numFiles": 5,
    "location": "{scheme}://some/path/to/table",
    "auxiliaryLocations": [
       "{scheme}://some/path/1",
       "{scheme}://some/path/2"
    ],
    "accessModes": ["url","dir"],
    "deltaMetadata": {
      "partitionColumns": [
        "date"
      ],
      "format": {
        "provider": "parquet"
      },
      "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"eventTime\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},{\"name\":\"date\",\"type\":\"date\",\"nullable\":true,\"metadata\":{}}]}",
      "id": "f8d5c169-3d01-4ca3-ad9e-7dc3355aedb2",
      "configuration": {
        "enableChangeDataFeed": "true"
      }
    }
  }
}

Generate Temporary Table Credential

Given that the directory and URL access code paths are distinct, their respective endpoints should remain separate rather than being combined. The response follows the format of GenerateTemporaryTableCredential in UC OSS. The location field is also added to introduce a potentially lightweight approach which avoids the metadata call and pre-processing the delta log. It should be the location which the credentials are generated for. Clients that do not support reading from a cloud vendor can throw an error.

HTTP Request Value
Method

POST

Headers

Authorization: Bearer {token}

Optional: Content-Type: application/json; charset=utf-8

Optional: delta-sharing-capabilities: responseformat=delta;readerfeatures=deletionvectors;accessModes=url,prefix

URL

{prefix}/shares/{share}/schemas/{schema}/tables/{table}/temporary-table-credentials

URL Parameters

{share}: The share name to query. It's case-insensitive.

{schema}: The schema name to query. It's case-insensitive.

{table}: The table name to query. It's case-insensitive.

Request Body

The location field is optional and specifies the location URL path to generate temporary credentials for. If this field is not provided, the response should contain credentials for the table's main location. If the main location is specified the server should still respond with the credential.

{
  "location": "{scheme}://some/path/to/table"
}
Response Body

Only one of awsTempCredentials, azureUserDelegationSas, gcpOauthToken should be defined.

{
  "credentials": {
    "location": "{scheme}://some/path/to/table",
    "awsTempCredentials": {
      "accessKeyId": "string",
      "secretAccessKey": "string",
      "sessionToken": "string"
    },
    "azureUserDelegationSas": {
      "sasToken": "string"
    },
    "gcpOauthToken": {
      "oauthToken": "string"
    },
    "expirationTime": 123456789
  }
}

TemporaryCredentials

Only one of awsTempCredentials, azureUserDelegationSas, gcpOauthToken should be defined. Their definitions follow Unity Catalog OSS models and APIs.

Name Type Description Notes
location string The directory which the temporary credentials are granted read access to. [required]
awsTempCredentials AwsCredentials [optional]
azureUserDelegationSas AzureUserDelegationSAS [optional]
gcpOauthToken GcpOauthToken [optional]
expirationTime Long Server time when the credential will expire, in epoch milliseconds. The API client is advised to cache the credential given this expiration time. [required]

AwsCredentials

Name Type Description Notes
accessKeyId String The access key ID that identifies the temporary credentials. [required]
secretAccessKey String The secret access key that can be used to sign AWS API requests. [required]
sessionToken String The token that users must pass to AWS API to use the temporary credentials. [required]

AzureUserDelegationSAS

Name Type Description Notes
sasToken String Azure SAS Token [required]

GcpOauthToken

Name Type Description Notes
oauthToken String Gcp Token [required]

Delta Kernel Example

To load a table shared with driectory access using Delta Kernel, follow these steps from the Delta Kernel documentation. The Hadoop configuration needs to be modified to contain the credentials used to authenticate with the cloud provider.

import io.delta.kernel.*;
import io.delta.kernel.defaults.*;
import org.apache.hadoop.conf.Configuration;

String myTablePath = "s3://some/path/to/table";
Configuration hadoopConf = new Configuration();
hadoopConf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");
hadoopConf.set("fs.s3a.access.key", "YOUR_ACCESS_KEY_ID");
hadoopConf.set("fs.s3a.secret.key", "YOUR_SECRET_ACCESS_KEY")
hadoopConf.set("fs.s3a.session.token", "YOUR_SESSION_TOKEN");
Engine myEngine = DefaultEngine.create(hadoopConf);
Table myTable = Table.forPath(myEngine, myTablePath);
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions