-
Notifications
You must be signed in to change notification settings - Fork 17
feat: transfer data with mscp #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid environment variables? Use command line options instead. Also, if we really want to use env vars, we should prefix them with ANEMOI_ so they have a namespace and cannot clash with other tools.
Using the anemoi config for instance: Then add an entry in ~/.config/anemoi/settings.toml And if sometimes you really want to use an env variable, you can to override the config: |
|
@floriankrb @b8raoult Env vars have been banished def _pickTransferTool():
tools = {"mscp": MscpUpload, "rsync": RsyncUpload, "scp": ScpUpload}
from anemoi.utils.config import load_config
tool = load_config().get("utils", {}).get("transfer_tool", None)
if tool is not None:
# check if the tool listed in the config can be found
if tool in tools and _isProgramOnPath(tool):
LOGGER.info(f"Using {tool} to transfer as specified in the anemoi utils config")
return tools[tool]
# Loops through this list in order until it finds a tool
for tool in tools:
if _isProgramOnPath(tool):
LOGGER.info(f"Using {tool} to transfer")
return tools[tool]
raise RuntimeError(f"No suitable transfer tool found. Looked for the following: {tools}")
SshUpload = _pickTransferTool() |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…to feat/mscp-transfer
|
I think there is an actual use case where the user deletes the whole output folder with a legitimate command. We had the issue before. We need an additional review from somebody else. @anaprietonem ? |
To me the PR looks fine, and based on the description from Cathal using 'mscp' could significantly improve the transfer times. Current implementation seems okey to me as it logs clearly the tool being used and having all listed in the code could be easier for developers to know what options are being considered. not sure I get your point Florian about |
| if src_basename in target: | ||
| LOGGER.debug(f"Removing {src_basename} from {target}") | ||
| target = target.strip(src_basename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with scp:
mkdir -p data/data/dir; echo data > data/data/dir/file; echo data > data/data/dir/data; rm -rf data/target; tree data
data
└── data
└── dir
├── data
└── file
2 directories, 2 files
➜ anemoi-utils --debug transfer --source data/data --target ssh://localhost:/tmp/data/target --overwrite; tree data
2025-08-25 15:38:28 INFO Using rsync to transfer
2025-08-25 15:38:28 INFO Deleting ssh://localhost:/tmp/data/target
2025-08-25 15:38:28 INFO Uploading data/data to ssh://localhost:/tmp/data/target
2025-08-25 15:38:28 INFO Uploading 2 files (10)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.0/10.0 [00:04<00:00, 2.15B/s]
data
├── data
│ └── dir
│ ├── data
│ └── file
└── target
└── dir
├── data
└── file
4 directories, 4 files
➜ anemoi-utils --debug transfer --source data/data --target ssh://localhost:/tmp/data/target --overwrite; tree data
2025-08-25 15:38:34 INFO Using rsync to transfer
2025-08-25 15:38:34 INFO Deleting ssh://localhost:/tmp/data/target
2025-08-25 15:38:34 INFO Uploading data/data to ssh://localhost:/tmp/data/target
2025-08-25 15:38:34 INFO Uploading 2 files (10)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.0/10.0 [00:04<00:00, 2.10B/s]
data
├── data
│ └── dir
│ ├── data
│ └── file
└── target
└── dir
├── data
└── file
With 'mscp' in the path:
mkdir -p data/data/dir; echo data > data/data/dir/file; echo data > data/data/dir/data; rm -rf data/target; tree data
data
└── data
└── dir
├── data
└── file
2 directories, 2 files
➜ PATH=$PATH:/path/to/mscp/dir anemoi-utils --debug transfer --source data/data --target ssh://localhost:/tmp/data/target --overwrite; tree data
2025-08-25 15:37:27 INFO Using mscp to transfer
2025-08-25 15:37:27 INFO Deleting ssh://localhost:/tmp/data/target
2025-08-25 15:37:27 DEBUG Removing data from ssh://localhost:/tmp/data/target
2025-08-25 15:37:27 DEBUG Copying data/data to ssh://localhost:/tmp/data/targe with Mscp
2025-08-25 15:37:27 INFO Uploading data/data to ssh://localhost:/tmp/data/targe (4 KiB)
data
├── data
│ └── dir
│ ├── data
│ └── file
└── target
└── dir
├── data
└── file
4 directories, 4 files
➜ PATH=$PATH:/path/to/mscp/dir anemoi-utils --debug transfer --source data/data --target ssh://localhost:/tmp/data/target --overwrite; tree data
2025-08-25 15:37:32 INFO Using mscp to transfer
2025-08-25 15:37:32 INFO Deleting ssh://localhost:/tmp/data/target
2025-08-25 15:37:32 DEBUG Removing data from ssh://localhost:/tmp/data/target
2025-08-25 15:37:32 DEBUG Copying data/data to ssh://localhost:/tmp/data/**targe** with Mscp
2025-08-25 15:37:32 INFO Uploading data/data to ssh://localhost:/tmp/data/**targe** (4 KiB)
data
├── data
│ └── dir
│ ├── data
│ └── file
└── target
├── data
│ └── dir
│ ├── data
│ └── file
└── dir
├── data
└── file
|
Are we looking to merge this or did you have any follow-up chat @cathalobrien / @floriankrb ? |
|
This PR creates an unexpected behaviour on a use case that we want to support (i.e. it adds a bug), so it cannot be merged as it is. But it is a nice addition and shows actual improvement when using mscp instead of scp. |
Description
copy data with mscp rather then rsync when its available
What problem does this change solve?
faster transfers
time to copy the 20GB dataset from atos to jupiter https://anemoi.ecmwf.int/datasets/aifs-od-an-oper-0001-mars-o96-2016-2024-6h-v1-land-reduced-snow
anemoi utils (mscp): 8m05s (16 threads)
anemoi utils (rsync):28m36s (16 threads)
So 3.5x faster on this example. But this varies depending on the file size, in my experience the speedup is greater on higher resolution datasets
I also changed the logic so that it passes the entire dir to mscp, not each file individually. this has a big impact on performance. I hardcoded it but ideally we'd have a flag in the Upload class copy_entire_dataset=True, which would pass the whole dataset in 1 to mscp. Maybe @floriankrb can point to where this would go.
MscpUpload() checks if mscp is on the path in the init. Currently it fails if mscp can't be found. Ideally it would fallback to rsync. Not sure where the best place to put this logic though.
Currently the progress bar is broken while using mscp, because I don't use transfer_folder.
TODO
As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/
By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.