Incomplete HTTP reads #1748

amotl · 2024-11-07T22:09:52Z

Hi there,

thanks a stack for conceiving and maintaining fsspec. We are successfully using it in a few projects and never observed any kinds of issues so far. 💯

Right now, we may have discovered an edge case that leads to truncated HTTP response bodies. A simple reproducer is attached below. This report is very similar to that other one,

Incomplete HTTP reads with 0.9.0 #614

... with the main difference is that we occasionally receive complete responses when downgrading to fsspec 0.9.0.

Does this make any sense?

With kind regards,
Andreas.

Code Snippet

import fsspec

fs = fsspec.open("https://github.com/pyveci/pueblo/raw/refs/heads/sfa/tests/testdata/entrypoint.py")
with fs as f:
    print(f.read())

Actual Response

b'def main():\n    print("Hallo, R\xc3\xa4uber Hotzenplotz.")  # noqa: T201\n    retur'

Expected Response

b'def main():\n    print("Hallo, R\xc3\xa4uber Hotzenplotz.")  # noqa: T201\n    return 42\n'

martindurant · 2024-11-08T03:02:29Z

A possible cause of this, is that the server reports the file size for the compressed file, but interprets Range request on the uncompressed data.

I see

'Content-Length': '95'
'Content-Encoding': 'gzip'

but the file is being assigned the size 81bytes; this is the number of bytes in the real output.

The actual length of the data returned is 76 bytes

So we have questions:

why is the file size assumed 81, when HEAD reports 97? I can only assume, that for this data, the gzip compression makes it bigger
why does Range=0-80 return 76 bytes? I can only assume 81 bytes of gzip data came, and the end-of-stream error was ignored.

Possible workarounds:

use cache_type="all"
cache the file by prepending "simplecache::" the the URL

Possible fixes:

fix fsspec.implementations.http._file_info to give the size expected by subsequent range requests
recognise that read() from position 0 doesn't need a range request at all.

amotl · 2024-11-08T20:22:16Z

Dear Martin,

thanks a stack for your swift reply. I will try one of the workarounds you suggested.

Do you think it also makes sense to report that observation, including your assessment, to GitHub SREs, in order to give them a chance to check their HTTP server configurations? I figure from your response something might be wrong over there, and also @d70-t reported at GH-614 that he observed such an issue with GitHub the other day.

With kind regards,
Andreas.

martindurant · 2024-11-08T21:27:55Z

You can maybe try reporting it. I'm not certain what the HTTP standards say the behaviour ought to be in this situation.

amotl · 2024-11-08T23:09:50Z

Possible workarounds:

use cache_type="all"

cache the file by prepending "simplecache::" the the URL

The second workaround works well, thank you very much.

fsspec.open(f"simplecache::{url}")

This incantation, however, does not change anything on the behaviour, the outcome is also a truncated response body.

fsspec.open(url, cache_type="all")

You can maybe try reporting it.

Will do, referring to this ticket and your excellent evaluation. Thanks. I will report back relevant details when applicable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete HTTP reads #1748

Incomplete HTTP reads #1748

amotl commented Nov 7, 2024 •

edited

Loading

martindurant commented Nov 8, 2024

amotl commented Nov 8, 2024

martindurant commented Nov 8, 2024

amotl commented Nov 8, 2024

Incomplete HTTP reads #1748

Incomplete HTTP reads #1748

Comments

amotl commented Nov 7, 2024 • edited Loading

Code Snippet

Actual Response

Expected Response

martindurant commented Nov 8, 2024

amotl commented Nov 8, 2024

martindurant commented Nov 8, 2024

amotl commented Nov 8, 2024

amotl commented Nov 7, 2024 •

edited

Loading