Skip to content

Conversation

@li-yi-dong
Copy link

For HDFS url with hostname like hdfs://hostname/user/xxx, the function resolve_pattern would drop the hostname, and outputs hdfs:///user/xxx. This may break later file operations by trying to connect to wrong HDFS cluster.

@li-yi-dong li-yi-dong changed the title add HDFS hostname to protocol prefix Bug fix: Add HDFS hostname to protocol prefix Jan 9, 2026
@lhoestq
Copy link
Member

lhoestq commented Jan 9, 2026

Hi ! is it related to #7934 ?

It's not clear to me why the protocol would need this, given hostname should be present in pattern already

resolve_pattern("hdfs://hostname/user/xxx", ...)

@li-yi-dong
Copy link
Author

Hi ! is it related to #7934 ?

It's not clear to me why the protocol would need this, given hostname should be present in pattern already

resolve_pattern("hdfs://hostname/user/xxx", ...)

It's related to #7934 in a subttle way. In my use case, I need to specify the hdfs hostname. In theory, I can do it by

ds = load_dataset(
    "parquet",
    data_files={
        "train": "hdfs://hostname/xxx*.parquet",
    },
    streaming=True,
)

or

ds = load_dataset(
    "parquet",
    data_files={
        "train": "hdfs:///xxx*.parquet",
    },
    streaming=True,
    storage_options={
        "host": "hostname"
    }
)

None of them work.
The first one does not work due to what this PR trying to fix, and the second one due to #7934.

Yes, resolve_pattern would be called like resolve_pattern("hdfs://hostname/user/xxx", ...), but its out put would be like hdfs:///user/xxx, no hostname in it. This output would be passed to later file operation like fsspec.open(). It needs the hostname in the url to find the HDFS cluster correctly.

@li-yi-dong
Copy link
Author

@lhoestq
Hi! Is there any concern?🙃

@lhoestq
Copy link
Member

lhoestq commented Jan 14, 2026

I see, I think the path forward is to fix #7934 which sounds like an actual xPath bug, while resolve_pattern dropping the hostname comes from fsspec HDFS implementation that we should probably try to follow

@li-yi-dong
Copy link
Author

Fixing #7934 alone can solve my problem.

But I don't think fsspec intends to drop the hostname. Function resolve_pattern here is supposed to convert a pattern to absolute file paths, and keeping the protocol intouched. fs.glob just returns the absolute paths to files, of which no hostname should in the result. The problem is how the function resolve_pattern reconstructs the whole path, ignoring the HDFS hostname in the protocol.

From another point of view, in resolve_pattern fs.glob is call with hdfs://hostname/user/xxx but latter fs.open is called with hdfs:///user/xxx, which is inconsistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants