Skip to content

Range bytes=-N header format not compatible with Azure Blob Storage #25893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eklesel opened this issue May 30, 2025 · 1 comment
Open

Range bytes=-N header format not compatible with Azure Blob Storage #25893

eklesel opened this issue May 30, 2025 · 1 comment

Comments

@eklesel
Copy link

eklesel commented May 30, 2025

We proxy our Azure Blob Storage using flexify.io, having migrated away from MinIO gateway. This allows us to provide an S3-compatible interface for Azure Blob storage.

When Trino tries to read a Parquet format file, it looks for the PAR1 magic number at the footer of the file, and to fetch the file it makes a GET request with a partial range to the storage endpoint (some headers have been removed):

GET /bucket/01971b9f-7b05-73a2-8ebd-af8c31907960/01971ba0-58d8-7da0-ad29-f089ae4994dc/00000-0-b1313c67-9270-422f-9e37-620e1d9803ee.parquet HTTP/1.1
...
Range: bytes=-49152
...

i.e the last 49152 bytes are requested.

Azure returns the following.

HTTP/1.1 200 OK
...
Content-Length: 10050405
x-ms-version: 2023-11-03
Accept-Ranges: bytes
x-ms-blob-type: BlockBlob
...

i.e it returns 10050405 bytes, or the entire file, and a 200 is returned instead of 206.

I'm not sure how Trino then searches the file for the magic bytes, however it results in the following error:

Query 20250530_101233_00002_msjz3 failed: Malformed Parquet file. Expected magic number: PAR1 got: u%?+

Given it's requesting the last N bytes, I'm assuming it looks at specific byte indexes for the magic number, but can't find it. Otherwise, it would find the magic bytes at the end of the file as normal.

This bytes=-N format is not supported by Azure for some reason. This works with AWS S3 storage directly.

Is there anything which can be done to get Trino to request the explicit range? Trino makes a HEAD request just prior to the GET request, so the Content-Length is known at the time the GET request is made, therefore it should be possible to request i.e bytes=(length-N)-length instead of bytes=-N to get the last N bytes.

Currently, we're resorting to use the legacy hive.s3 filesystem (doc), which uses the Range: bytes=0-9223372036854775806 header and works, however obviously this is deprecated and will be removed in a future release so don't want to rely on it.

@raunaqmorarka
Copy link
Member

The azure client is doing what Azure supports in io.trino.filesystem.azure.AzureInput#readTail and the S3 client is doing what S3 supports in io.trino.filesystem.s3.S3Input#readTail.
It's possible to change the S3 client to do whatever is the lowest common denominator between S3 and Azure, but that feels brittle as we wouldn't be able to test any future changes against this scenario. Our S3 tests run only against S3 compatible implementations.
I think it is justified for the S3 client to expect an S3 compatible interface and not have to change its behaviour to adjust for the backend not supporting some aspects of S3 interface. I would expect the proxy layer that you're using to adapt to the mismatch in interfaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants