-
Notifications
You must be signed in to change notification settings - Fork 17
Description
What happened?
Anemoi training, building a graph. reading grids from a custom location. I am running a multi-GPU job, 1 process makes it through the others crash. And I am running multiple jobs at the same time, each trying to build a graph at the same time.
anemoi utils looks for a /home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz. I can share my full config on request, but I have checked and this string is not coming from my end. It looks like a temporary file which got deleted.
Below is the full stack trace. 3 processes crash, and looks like the traces got interleaved, so it's quite long. sorry.
2025-08-13 09:03:41 INFO Reading the dataset from /home/mlx/ai-ml/datasets/aifs-ea-an-oper-0001-mars-n320-2020-2020-6h-v4.zarr.
2025-08-13 09:03:41 INFO Reading the dataset from /home/mlx/ai-ml/datasets/aifs-ea-an-oper-0001-mars-n320-2020-2020-6h-v4.zarr.
2025-08-13 09:03:41 INFO Reading the dataset from /home/mlx/ai-ml/datasets/aifs-ea-an-oper-0001-mars-n320-2020-2020-6h-v4.zarr.
2025-08-13 09:03:41 INFO Reading the dataset from /home/mlx/ai-ml/datasets/aifs-ea-an-oper-0001-mars-n320-2020-2020-6h-v4.zarr.
2025-08-13 09:03:51 WARNING Loading grids from custom user path /lus/h2resw01/hpcperm/naco/raps-ac/build/grids/grid-o96.npz
2025-08-13 09:03:51 WARNING Loading grids from custom user path /lus/h2resw01/hpcperm/naco/raps-ac/build/grids/grid-o96.npz
2025-08-13 09:03:51 WARNING Loading grids from custom user path /lus/h2resw01/hpcperm/naco/raps-ac/build/grids/grid-o96.npz
2025-08-13 09:03:51 WARNING Loading grids from custom user path /lus/h2resw01/hpcperm/naco/raps-ac/build/grids/grid-o96.npz
Traceback (most recent call last):
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/venv/bin/anemoi-training", line 8, in <module>
sys.exit(main())
^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/__main__.py", line 23, in main
sys.exit(main())
^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/__main__.py", line 23, in main
_run_hydra(
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
cli_main(__version__, __doc__, COMMANDS)
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/cli.py", line 232, in cli_main
_run_app(
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
cmd.run(args, unknown)
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/commands/profiler.py", line 33, in run
cli_main(__version__, __doc__, COMMANDS)
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/cli.py", line 232, in cli_main
run_and_report(
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
cmd.run(args, unknown)
raise ex
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/commands/profiler.py", line 33, in run
main()
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/commands/profiler.py", line 44, in main
return func()
^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
main()
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/commands/profiler.py", line 44, in main
anemoi_profile()
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
_run_hydra(
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
anemoi_profile()
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/profiler.py", line 308, in datamodule
self._log_information()
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 403, in _log_informa
tion
datamodule = super().datamodule
^^^^^^^^^^^^^^^^^^
File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
num_fc_features = len(self.datamodule.ds_train.data.variables) - len(self.config.data.forcing)
^^^^^^^^^^^^^^^
File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
num_fc_features = len(self.datamodule.ds_train.data.variables) - len(self.config.data.forcing)
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 104, in datamodule
^^^^^^^^^^^^^^^
File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/profiler.py", line 308, in datamodul
e
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/profiler.py", line 308, in datamodul
e
self.graph_data,
^^^^^^^^^^^^^^^
File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
datamodule = super().datamodule
^^^^^^^^^^^^^^^^^^
File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 160, in graph_data
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 104, in datamodule
datamodule = super().datamodule
^^^^^^^^^^^^^^^^^^
File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 104, in datamodule
datamodule = super().datamodule
^^^^^^^^^^^^^^^^^^
File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 104, in datamodule
self.graph_data,
^^^^^^^^^^^^^^^
File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
return GraphCreator(config=graph_config).create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 180, in create
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 160, in graph_data
self.graph_data,
^^^^^^^^^^^^^^^
File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 160, in graph_data
return GraphCreator(config=graph_config).create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 180, in create
return GraphCreator(config=graph_config).create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 180, in create
graph = self.update_graph(graph)
graph = self.update_graph(graph)
graph = self.update_graph(graph)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 80, in update_graph
^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 80, in update_graph
^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 80, in update_graph
graph = instantiate(nodes_cfg.node_builder, name=nodes_name).update_graph(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
graph = instantiate(nodes_cfg.node_builder, name=nodes_name).update_graph(
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 139, in update_graph
graph = instantiate(nodes_cfg.node_builder, name=nodes_name).update_graph(
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 139, in update_graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 139, in update_graph
graph = self.register_nodes(graph)
graph = self.register_nodes(graph)
graph = self.register_nodes(graph)
^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 60, in register_nodes
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 60, in register_nodes
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 60, in register_nodes
graph[self.name].x = self.get_coordinates().to(torch.float32)
graph[self.name].x = self.get_coordinates().to(torch.float32)
^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/from_reduced_gaussian.py", line 65, in get_coordinates
graph[self.name].x = self.get_coordinates().to(torch.float32)
^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/from_reduced_gaussian.py", line 65, in get_coordinates
^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/from_reduced_gaussian.py", line 65, in get_coordinates
grid_data = grids(self.grid)
grid_data = grids(self.grid)
^^^^^^^^^^^^^^^^
grid_data = grids(self.grid)
^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/deprecation.py", line 260, in _inner
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/deprecation.py", line 260, in _inner
^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/deprecation.py", line 260, in _inner
return function(*args, **kwargs)
return function(*args, **kwargs)
return function(*args, **kwargs)
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/grids.py", line 215, in grids
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/grids.py", line 215, in grids
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/grids.py", line 215, in grids
data = _grids(name)
data = _grids(name)
data = _grids(name)
^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 94, in wrapped
^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 94, in wrapped
^^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 94, in wrapped
return self.cache(
return self.cache(
^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 140, in cache
return self.cache(
^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 140, in cache
^^^^^^^^^^^
File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 140, in cache
os.rename(temp_filename, filename)
os.rename(temp_filename, filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz.tmp.npz' -> '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz'
FileNotFoundError: [Errno 2] No such file or directory: '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz.tmp.npz' -> '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz'
os.rename(temp_filename, filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz.tmp.npz' -> '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz'
What are the steps to reproduce the bug?
not sure, try have multiple jobs each trying to build a graph at the same time.
Version
0.4.28
Platform (OS and architecture)
Linux ag6-400 5.14.0-427.42.1.el9_4.aarch64+64k #1 SMP PREEMPT_DYNAMIC Fri Oct 18 18:54:50 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux (but observed on Atos AC too)
Relevant log output
aboveAccompanying data
No response
Organisation
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status