Skip to content

Conversation

@cathalobrien
Copy link

@cathalobrien cathalobrien commented Aug 11, 2025

Description

copy data with mscp rather then rsync when its available

What problem does this change solve?

faster transfers

time to copy the 20GB dataset from atos to jupiter https://anemoi.ecmwf.int/datasets/aifs-od-an-oper-0001-mars-o96-2016-2024-6h-v1-land-reduced-snow
anemoi utils (mscp): 8m05s (16 threads)
anemoi utils (rsync):28m36s (16 threads)

So 3.5x faster on this example. But this varies depending on the file size, in my experience the speedup is greater on higher resolution datasets

I also changed the logic so that it passes the entire dir to mscp, not each file individually. this has a big impact on performance. I hardcoded it but ideally we'd have a flag in the Upload class copy_entire_dataset=True, which would pass the whole dataset in 1 to mscp. Maybe @floriankrb can point to where this would go.

MscpUpload() checks if mscp is on the path in the init. Currently it fails if mscp can't be found. Ideally it would fallback to rsync. Not sure where the best place to put this logic though.

Currently the progress bar is broken while using mscp, because I don't use transfer_folder.

TODO

  • copy_whole_directory flag
  • fallback to rsync if mscp can't be found

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

Copy link
Collaborator

@b8raoult b8raoult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid environment variables? Use command line options instead. Also, if we really want to use env vars, we should prefix them with ANEMOI_ so they have a namespace and cannot clash with other tools.

@floriankrb
Copy link
Member

Can we avoid environment variables? Use command line options instead. Also, if we really want to use env vars, we should prefix them with ANEMOI_ so they have a namespace and cannot clash with other tools.

Using the anemoi config for instance:

from anemoi.utils.config import load_config
tool = load_config().get("utils", {}).get("transfer_tool")

Then add an entry in ~/.config/anemoi/settings.toml

[utils]
transfer_tool=rsync

And if sometimes you really want to use an env variable, you can to override the config:
ANEMOI_CONFIG_UTILS_TRANSFER_TOOL=mscp anemoi-... ...

@cathalobrien
Copy link
Author

cathalobrien commented Aug 12, 2025

@floriankrb @b8raoult Env vars have been banished

def _pickTransferTool():
    tools = {"mscp": MscpUpload, "rsync": RsyncUpload, "scp": ScpUpload}

    from anemoi.utils.config import load_config

    tool = load_config().get("utils", {}).get("transfer_tool", None)
    if tool is not None:
        # check if the tool listed in the config can be found
        if tool in tools and _isProgramOnPath(tool):
            LOGGER.info(f"Using {tool} to transfer as specified in the anemoi utils config")
            return tools[tool]

    # Loops through this list in order until it finds a tool
    for tool in tools:
        if _isProgramOnPath(tool):
            LOGGER.info(f"Using {tool} to transfer")
            return tools[tool]
    raise RuntimeError(f"No suitable transfer tool found. Looked for the following: {tools}")


SshUpload = _pickTransferTool()

@cathalobrien cathalobrien changed the title Feat: transfer data with mscp feat: transfer data with mscp Aug 12, 2025
@github-actions github-actions bot added the tests label Aug 13, 2025
@floriankrb
Copy link
Member

I think there is an actual use case where the user deletes the whole output folder with a legitimate command. We had the issue before.
Maybe something like this anemoi-utils copy /home/me/data /scratch/data/data --overwrite
On the other hand, keeping this PR waiting is not so good either, and it has been tested with various cases.

We need an additional review from somebody else. @anaprietonem ?

@anaprietonem
Copy link
Collaborator

anaprietonem commented Aug 25, 2025

I think there is an actual use case where the user deletes the whole output folder with a legitimate command. We had the issue before. Maybe something like this anemoi-utils copy /home/me/data /scratch/data/data --overwrite On the other hand, keeping this PR waiting is not so good either, and it has been tested with various cases.

We need an additional review from somebody else. @anaprietonem ?

To me the PR looks fine, and based on the description from Cathal using 'mscp' could significantly improve the transfer times. Current implementation seems okey to me as it logs clearly the tool being used and having all listed in the code could be easier for developers to know what options are being considered.

not sure I get your point Florian about the user deletes the whole output folder with a legitimate command. and Maybe something like this anemoi-utils copy /home/me/data /scratch/data/data --overwrite - is this related to the mscp implementation or something else?

Comment on lines +192 to +194
if src_basename in target:
LOGGER.debug(f"Removing {src_basename} from {target}")
target = target.strip(src_basename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems unsafe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with scp:

mkdir -p data/data/dir; echo data > data/data/dir/file; echo data > data/data/dir/data; rm -rf data/target; tree data
data
└── data
    └── dir
        ├── data
        └── file

2 directories, 2 files
➜ anemoi-utils --debug transfer --source data/data --target ssh://localhost:/tmp/data/target --overwrite; tree data
2025-08-25 15:38:28 INFO Using rsync to transfer
2025-08-25 15:38:28 INFO Deleting ssh://localhost:/tmp/data/target
2025-08-25 15:38:28 INFO Uploading data/data to ssh://localhost:/tmp/data/target
2025-08-25 15:38:28 INFO Uploading 2 files (10)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.0/10.0 [00:04<00:00, 2.15B/s]
data
├── data
│   └── dir
│       ├── data
│       └── file
└── target
    └── dir
        ├── data
        └── file

4 directories, 4 files
➜ anemoi-utils --debug transfer --source data/data --target ssh://localhost:/tmp/data/target --overwrite; tree data
2025-08-25 15:38:34 INFO Using rsync to transfer
2025-08-25 15:38:34 INFO Deleting ssh://localhost:/tmp/data/target
2025-08-25 15:38:34 INFO Uploading data/data to ssh://localhost:/tmp/data/target
2025-08-25 15:38:34 INFO Uploading 2 files (10)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.0/10.0 [00:04<00:00, 2.10B/s]
data
├── data
│   └── dir
│       ├── data
│       └── file
└── target
    └── dir
        ├── data
        └── file

With 'mscp' in the path:

mkdir -p data/data/dir; echo data > data/data/dir/file; echo data > data/data/dir/data; rm -rf data/target; tree data
data
└── data
    └── dir
        ├── data
        └── file

2 directories, 2 files
➜ PATH=$PATH:/path/to/mscp/dir anemoi-utils --debug transfer --source data/data --target ssh://localhost:/tmp/data/target --overwrite; tree data
2025-08-25 15:37:27 INFO Using mscp to transfer
2025-08-25 15:37:27 INFO Deleting ssh://localhost:/tmp/data/target
2025-08-25 15:37:27 DEBUG Removing data from ssh://localhost:/tmp/data/target
2025-08-25 15:37:27 DEBUG Copying data/data to ssh://localhost:/tmp/data/targe with Mscp
2025-08-25 15:37:27 INFO Uploading data/data to ssh://localhost:/tmp/data/targe (4 KiB)
data
├── data
│   └── dir
│       ├── data
│       └── file
└── target
    └── dir
        ├── data
        └── file

4 directories, 4 files
➜ PATH=$PATH:/path/to/mscp/dir anemoi-utils --debug transfer --source data/data --target ssh://localhost:/tmp/data/target --overwrite; tree data
2025-08-25 15:37:32 INFO Using mscp to transfer
2025-08-25 15:37:32 INFO Deleting ssh://localhost:/tmp/data/target
2025-08-25 15:37:32 DEBUG Removing data from ssh://localhost:/tmp/data/target
2025-08-25 15:37:32 DEBUG Copying data/data to ssh://localhost:/tmp/data/**targe** with Mscp
2025-08-25 15:37:32 INFO Uploading data/data to ssh://localhost:/tmp/data/**targe** (4 KiB)
data
├── data
│   └── dir
│       ├── data
│       └── file
└── target
    ├── data
    │   └── dir
    │       ├── data
    │       └── file
    └── dir
        ├── data
        └── file

@anaprietonem
Copy link
Collaborator

Are we looking to merge this or did you have any follow-up chat @cathalobrien / @floriankrb ?

@floriankrb
Copy link
Member

This PR creates an unexpected behaviour on a use case that we want to support (i.e. it adds a bug), so it cannot be merged as it is.

But it is a nice addition and shows actual improvement when using mscp instead of scp.

@HCookie HCookie moved this from To be triaged to On Pause in Anemoi-dev Nov 17, 2025
@floriankrb floriankrb marked this pull request as draft December 8, 2025 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: On Pause

Development

Successfully merging this pull request may close these issues.

5 participants