Skip to content

rsync accross FS with different protocol or credentials is broken #1856

Open
@adm271828

Description

@adm271828

Hi,

I'm completely new to fsspec and was extremely interested by the rsync utility (even if signaled as experimental) and gave it a try (version 2025.5.1).

My first use case is to sync a S3 source to a local directory.

Having a quick look at the source code, I tried something like this:

rsync('s3://bucket-name/key', './path/to/local/directory',
  inst_kwargs = {
    'default_method': 'options',
    'storage_options': {
      's3': { 'endpoint_url': 'xxxxxx', 'key': 'xxxxxxx', 'secret': 'xxxx', },
    },
  }
)

The idea being to let the GenericFileSystem instance used by rsync resolve both underlying filesystem behind source and destination.

Unfortunately this didn't work, because the GenericFileSystem:

  • didn't override all methods rsync uses, namely isdir
  • didn't resolve correctly URL

First problem can be solved by adding the following method to GenericFileSystem (I have absolutely no idea if this is correct, it simply mimics the way it is done for other methods):

    async def _isdir(
        self,
        url,
        **kwargs,
    ):
        fs = _resolve_fs(url, self.method, storage_options=self.st_opts)
        if fs.async_impl:
            return await fs._isdir(url, **kwargs)
        else:
            return fs.isdir(url, **kwargs)

Second problem was addressed by making sure every call to _resolve_fs uses the storage_options=self.st_opts argument (it is missing in many places).

Then it worked.

This is however not very efficient, because every call to _resolve_fs creates a new instance. rsync should probably call it only twice (once for the source, once for the destination, if they do not have the same protocol).

Then came the next question: what if I want to synchronise two S3 directories in differents buckets with different credentials? Also a use case I have.

Since _resolve_fs only uses the protocol to find a filesystem instance, this will not work.

Among the possible solutions I figured out:

  • let _resolve_fs handle URL as URI and discriminate both on protocol and authority. At least the bucket name could be used to select different credentials in inst_kwargs. This would also probably work with other protocols
  • make rsync take a source_fs and destination_fs argument instead of a unique fs

Hope this helps.

Best regards,

Antoine

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions