Skip to content

[Bug]: HTTPFileSystem not working due to general incompatibility of Path and PurePosixPath in SimpleDirectoryReader with fsspec #19157

Open
@lsg551

Description

@lsg551

Bug Description

According to the docs, SimpleDirectoryReaders supports all fsspec file systems. But fsspec's HTTPFileSystem seems incompatible with SimpleDirectoryReader. It's centerd around Path and PurePosixPath objects, although neither of these handles URLs correctly, which HTTPFileSystem relies on.

For instance, this simplified example will fail with a FileNotFoundError: http:/localhost:8080/foo.txt:

fs = HTTPFileSystem()
reader = SimpleDirectoryReader(
    input_files=["http://localhost:8080/foo.txt"],
    fs=fs,
)
docs = reader.load_data()

As you can see, the URL scheme delimiter in the error is missing a double slash. This results in an invalid URL, which is why the file cannot be found.

The issue is right at the beginning of the __init__ method of SimpleDirectoryReader, where the input_files are processed:

_Path = Path if is_default_fs(self.fs) else PurePosixPath
if input_files:
self.input_files = []
for path in input_files:
if not self.fs.isfile(path):
raise ValueError(f"File {path} does not exist.")
input_file = _Path(path)
self.input_files.append(input_file)

The code attempts to convert the input files into Path or PurePosixPath objects, which do not handle URLs correctly, leading to the malformed URL in the error message.

>>> from pathlib import Path, PurePosixPath
>>> Path("http://localhost:8080/foo.txt")
... PosixPath("http:/localhost:8080/foo.txt")
>>> PurePosixPath("http://localhost:8080/foo.txt")
... PurePosixPath("http:/localhost:8080/foo.txt'")

As far as I can tell, another user already had a similiar issue in #16602 (with a pending PR #16612), but with an S3-based fs. In #15406 the same. However, these issues focus on file-specific readers, such as PDFReader.
In this comment, a user already pointed out that Path and PurePosixPath are not suitable for fsspec.

I spent some time trying to figure out what was going on and first checked that everything else was working.
I recommend mentioning this in the docs to save some users time should the issue persist longer.

Version

0.12.43

Steps to Reproduce

fs = HTTPFileSystem()
reader = SimpleDirectoryReader(
    input_files=["http://localhost:8080/foo.txt"],
    fs=fs,
)
docs = reader.load_data()

Relevant Logs/Tracbacks

`FileNotFoundError: http:/localhost:8080/foo.txt`

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageIssue needs to be triaged/prioritized

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions