Description
Bug Description
According to the docs, SimpleDirectoryReaders
supports all fsspec
file systems. But fsspec
's HTTPFileSystem
seems incompatible with SimpleDirectoryReader
. It's centerd around Path
and PurePosixPath
objects, although neither of these handles URLs correctly, which HTTPFileSystem
relies on.
For instance, this simplified example will fail with a FileNotFoundError: http:/localhost:8080/foo.txt
:
fs = HTTPFileSystem()
reader = SimpleDirectoryReader(
input_files=["http://localhost:8080/foo.txt"],
fs=fs,
)
docs = reader.load_data()
As you can see, the URL scheme delimiter in the error is missing a double slash. This results in an invalid URL, which is why the file cannot be found.
The issue is right at the beginning of the __init__
method of SimpleDirectoryReader
, where the input_files
are processed:
llama_index/llama-index-core/llama_index/core/readers/file/base.py
Lines 284 to 292 in 0b77334
The code attempts to convert the input files into Path
or PurePosixPath
objects, which do not handle URLs correctly, leading to the malformed URL in the error message.
>>> from pathlib import Path, PurePosixPath
>>> Path("http://localhost:8080/foo.txt")
... PosixPath("http:/localhost:8080/foo.txt")
>>> PurePosixPath("http://localhost:8080/foo.txt")
... PurePosixPath("http:/localhost:8080/foo.txt'")
As far as I can tell, another user already had a similiar issue in #16602 (with a pending PR #16612), but with an S3-based fs. In #15406 the same. However, these issues focus on file-specific readers, such as PDFReader
.
In this comment, a user already pointed out that Path
and PurePosixPath
are not suitable for fsspec
.
I spent some time trying to figure out what was going on and first checked that everything else was working.
I recommend mentioning this in the docs to save some users time should the issue persist longer.
Version
0.12.43
Steps to Reproduce
fs = HTTPFileSystem()
reader = SimpleDirectoryReader(
input_files=["http://localhost:8080/foo.txt"],
fs=fs,
)
docs = reader.load_data()
Relevant Logs/Tracbacks
`FileNotFoundError: http:/localhost:8080/foo.txt`