Skip to content

Utils grids cache looks for a non-existent grid, crashes #206

@cathalobrien

Description

@cathalobrien

What happened?

Anemoi training, building a graph. reading grids from a custom location. I am running a multi-GPU job, 1 process makes it through the others crash. And I am running multiple jobs at the same time, each trying to build a graph at the same time.

anemoi utils looks for a /home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz. I can share my full config on request, but I have checked and this string is not coming from my end. It looks like a temporary file which got deleted.

Below is the full stack trace. 3 processes crash, and looks like the traces got interleaved, so it's quite long. sorry.

2025-08-13 09:03:41 INFO Reading the dataset from /home/mlx/ai-ml/datasets/aifs-ea-an-oper-0001-mars-n320-2020-2020-6h-v4.zarr.
2025-08-13 09:03:41 INFO Reading the dataset from /home/mlx/ai-ml/datasets/aifs-ea-an-oper-0001-mars-n320-2020-2020-6h-v4.zarr.
2025-08-13 09:03:41 INFO Reading the dataset from /home/mlx/ai-ml/datasets/aifs-ea-an-oper-0001-mars-n320-2020-2020-6h-v4.zarr.
2025-08-13 09:03:41 INFO Reading the dataset from /home/mlx/ai-ml/datasets/aifs-ea-an-oper-0001-mars-n320-2020-2020-6h-v4.zarr.
2025-08-13 09:03:51 WARNING Loading grids from custom user path /lus/h2resw01/hpcperm/naco/raps-ac/build/grids/grid-o96.npz
2025-08-13 09:03:51 WARNING Loading grids from custom user path /lus/h2resw01/hpcperm/naco/raps-ac/build/grids/grid-o96.npz
2025-08-13 09:03:51 WARNING Loading grids from custom user path /lus/h2resw01/hpcperm/naco/raps-ac/build/grids/grid-o96.npz
2025-08-13 09:03:51 WARNING Loading grids from custom user path /lus/h2resw01/hpcperm/naco/raps-ac/build/grids/grid-o96.npz
Traceback (most recent call last):
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/venv/bin/anemoi-training", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/__main__.py", line 23, in main
    sys.exit(main())
             ^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/__main__.py", line 23, in main
    _run_hydra(
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    cli_main(__version__, __doc__, COMMANDS)
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/cli.py", line 232, in cli_main
    _run_app(
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    cmd.run(args, unknown)
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/commands/profiler.py", line 33, in run
    cli_main(__version__, __doc__, COMMANDS)
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/cli.py", line 232, in cli_main
    run_and_report(
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    cmd.run(args, unknown)
    raise ex
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/commands/profiler.py", line 33, in run
    main()
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/commands/profiler.py", line 44, in main
    return func()
           ^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    main()
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/commands/profiler.py", line 44, in main
    anemoi_profile()
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    _run_hydra(
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    anemoi_profile()
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/profiler.py", line 308, in datamodule
    self._log_information()
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 403, in _log_informa
tion
    datamodule = super().datamodule
                 ^^^^^^^^^^^^^^^^^^
  File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
    num_fc_features = len(self.datamodule.ds_train.data.variables) - len(self.config.data.forcing)
                          ^^^^^^^^^^^^^^^
  File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
    num_fc_features = len(self.datamodule.ds_train.data.variables) - len(self.config.data.forcing)
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 104, in datamodule
                          ^^^^^^^^^^^^^^^
  File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/profiler.py", line 308, in datamodul
e
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/profiler.py", line 308, in datamodul
e
    self.graph_data,
    ^^^^^^^^^^^^^^^
  File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
    datamodule = super().datamodule
                 ^^^^^^^^^^^^^^^^^^
  File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 160, in graph_data
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 104, in datamodule
    datamodule = super().datamodule
                 ^^^^^^^^^^^^^^^^^^
  File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 104, in datamodule
    datamodule = super().datamodule
                 ^^^^^^^^^^^^^^^^^^
  File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 104, in datamodule
    self.graph_data,
    ^^^^^^^^^^^^^^^
  File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
    return GraphCreator(config=graph_config).create(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 180, in create
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 160, in graph_data
    self.graph_data,
    ^^^^^^^^^^^^^^^
  File "/usr/local/apps/python3/3.12.9-01/lib/python3.12/functools.py", line 998, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/lus/h2resw01/hpcperm/naco/raps-dev/ag-build/sources/anemoi-core/training/src/anemoi/training/train/train.py", line 160, in graph_data
    return GraphCreator(config=graph_config).create(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 180, in create
    return GraphCreator(config=graph_config).create(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 180, in create
    graph = self.update_graph(graph)
    graph = self.update_graph(graph)
    graph = self.update_graph(graph)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 80, in update_graph
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 80, in update_graph
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/create.py", line 80, in update_graph
    graph = instantiate(nodes_cfg.node_builder, name=nodes_name).update_graph(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    graph = instantiate(nodes_cfg.node_builder, name=nodes_name).update_graph(
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 139, in update_graph
    graph = instantiate(nodes_cfg.node_builder, name=nodes_name).update_graph(
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 139, in update_graph
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 139, in update_graph
    graph = self.register_nodes(graph)
    graph = self.register_nodes(graph)
    graph = self.register_nodes(graph)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 60, in register_nodes
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 60, in register_nodes
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/base.py", line 60, in register_nodes
    graph[self.name].x = self.get_coordinates().to(torch.float32)
    graph[self.name].x = self.get_coordinates().to(torch.float32)
                         ^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/from_reduced_gaussian.py", line 65, in get_coordinates
    graph[self.name].x = self.get_coordinates().to(torch.float32)
                         ^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/from_reduced_gaussian.py", line 65, in get_coordinates
                         ^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/graphs/nodes/builders/from_reduced_gaussian.py", line 65, in get_coordinates
    grid_data = grids(self.grid)
    grid_data = grids(self.grid)
                ^^^^^^^^^^^^^^^^
    grid_data = grids(self.grid)
                ^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/deprecation.py", line 260, in _inner
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/deprecation.py", line 260, in _inner
                ^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/deprecation.py", line 260, in _inner
    return function(*args, **kwargs)
    return function(*args, **kwargs)
    return function(*args, **kwargs)
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/grids.py", line 215, in grids
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/grids.py", line 215, in grids
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/grids.py", line 215, in grids
    data = _grids(name)
    data = _grids(name)
    data = _grids(name)
           ^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 94, in wrapped
           ^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 94, in wrapped
           ^^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 94, in wrapped
    return self.cache(
    return self.cache(
           ^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 140, in cache
    return self.cache(
           ^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 140, in cache
           ^^^^^^^^^^^
  File "/perm/naco/venvs/raps-ag/lib/python3.12/site-packages/anemoi/utils/caching.py", line 140, in cache
    os.rename(temp_filename, filename)
    os.rename(temp_filename, filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz.tmp.npz' -> '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz'
FileNotFoundError: [Errno 2] No such file or directory: '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz.tmp.npz' -> '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz'
    os.rename(temp_filename, filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz.tmp.npz' -> '/home/naco/.cache/anemoi/grids/259342e2a568a1a5698109531c4927aa.npz'

What are the steps to reproduce the bug?

not sure, try have multiple jobs each trying to build a graph at the same time.

Version

0.4.28

Platform (OS and architecture)

Linux ag6-400 5.14.0-427.42.1.el9_4.aarch64+64k #1 SMP PREEMPT_DYNAMIC Fri Oct 18 18:54:50 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux (but observed on Atos AC too)

Relevant log output

above

Accompanying data

No response

Organisation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    To be triaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions