Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete HTTP reads #1748

Open
amotl opened this issue Nov 7, 2024 · 4 comments
Open

Incomplete HTTP reads #1748

amotl opened this issue Nov 7, 2024 · 4 comments

Comments

@amotl
Copy link
Contributor

amotl commented Nov 7, 2024

Hi there,

thanks a stack for conceiving and maintaining fsspec. We are successfully using it in a few projects and never observed any kinds of issues so far. 💯

Right now, we may have discovered an edge case that leads to truncated HTTP response bodies. A simple reproducer is attached below. This report is very similar to that other one,

... with the main difference is that we occasionally receive complete responses when downgrading to fsspec 0.9.0.

Does this make any sense?

With kind regards,
Andreas.

Code Snippet

import fsspec

fs = fsspec.open("https://github.com/pyveci/pueblo/raw/refs/heads/sfa/tests/testdata/entrypoint.py")
with fs as f:
    print(f.read())

Actual Response

b'def main():\n    print("Hallo, R\xc3\xa4uber Hotzenplotz.")  # noqa: T201\n    retur'

Expected Response

b'def main():\n    print("Hallo, R\xc3\xa4uber Hotzenplotz.")  # noqa: T201\n    return 42\n'
@martindurant
Copy link
Member

A possible cause of this, is that the server reports the file size for the compressed file, but interprets Range request on the uncompressed data.

I see

'Content-Length': '95'
'Content-Encoding': 'gzip'

but the file is being assigned the size 81bytes; this is the number of bytes in the real output.

The actual length of the data returned is 76 bytes

So we have questions:

  • why is the file size assumed 81, when HEAD reports 97? I can only assume, that for this data, the gzip compression makes it bigger
  • why does Range=0-80 return 76 bytes? I can only assume 81 bytes of gzip data came, and the end-of-stream error was ignored.

Possible workarounds:

  • use cache_type="all"
  • cache the file by prepending "simplecache::" the the URL

Possible fixes:

  • fix fsspec.implementations.http._file_info to give the size expected by subsequent range requests
  • recognise that read() from position 0 doesn't need a range request at all.

@amotl
Copy link
Contributor Author

amotl commented Nov 8, 2024

Dear Martin,

thanks a stack for your swift reply. I will try one of the workarounds you suggested.

Do you think it also makes sense to report that observation, including your assessment, to GitHub SREs, in order to give them a chance to check their HTTP server configurations? I figure from your response something might be wrong over there, and also @d70-t reported at GH-614 that he observed such an issue with GitHub the other day.

With kind regards,
Andreas.

@martindurant
Copy link
Member

You can maybe try reporting it. I'm not certain what the HTTP standards say the behaviour ought to be in this situation.

@amotl
Copy link
Contributor Author

amotl commented Nov 8, 2024

Possible workarounds:

  • use cache_type="all"
  • cache the file by prepending "simplecache::" the the URL

The second workaround works well, thank you very much.

fsspec.open(f"simplecache::{url}")

This incantation, however, does not change anything on the behaviour, the outcome is also a truncated response body.

fsspec.open(url, cache_type="all")

You can maybe try reporting it.

Will do, referring to this ticket and your excellent evaluation. Thanks. I will report back relevant details when applicable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants