Skip to content

Conversation

@clessig
Copy link
Collaborator

@clessig clessig commented Jan 28, 2026

Description

Re-enabling robust integration test

Issue Number

Closes #1712

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@github-actions github-actions bot added the infra Issues related to infrastructure label Jan 28, 2026
@clessig
Copy link
Collaborator Author

clessig commented Jan 28, 2026

Evaluation currently breaks with (CC @SavvasMel @iluise):

Opening zipstore, read-only: True
FAILEDend fixture


====================================================== FAILURES =======================================================
__________________________________ test_train_multi_stream[test_multi_stream_77fb7] ___________________________________

setup = None, test_run_id = 'test_multi_stream_77fb7'

    @pytest.mark.parametrize("test_run_id", ["test_multi_stream_" + commit_hash])
    def test_train_multi_stream(setup, test_run_id):
        """Test training with multiple streams including gridded and observation data."""
        logger.info(f"test_train_multi_stream with run_id {test_run_id} {WEATHERGEN_HOME}")
    
        train_with_args(
            f"--base-config={WEATHERGEN_HOME}/integration_tests/small_multi_stream.yaml".split()
            + [
                "--run-id",
                test_run_id,
            ],
            f"{WEATHERGEN_HOME}/integration_tests/streams_multi/",
        )
    
        infer_multi_stream(test_run_id)
>       evaluate_multi_stream_results(test_run_id)

integration_tests/small_multi_stream_test.py:71: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
integration_tests/small_multi_stream_test.py:159: in evaluate_multi_stream_results
    evaluate_from_config(cfg, None, None)
packages/evaluate/src/weathergen/evaluate/run_evaluation.py:334: in evaluate_from_config
    results = [_process_stream(**task) for task in tasks]
packages/evaluate/src/weathergen/evaluate/run_evaluation.py:230: in _process_stream
    plot_data(reader, stream, global_plotting_opts)
packages/evaluate/src/weathergen/evaluate/utils/utils.py:399: in plot_data
    maps_config = common_ranges(
packages/evaluate/src/weathergen/evaluate/utils/utils.py:592: in common_ranges
    list_max = calc_bounds(data_tars, data_preds, var, "max")
packages/evaluate/src/weathergen/evaluate/utils/utils.py:648: in calc_bounds
    calc_val(da_tars.where(da_tars.channel == var, drop=True), bound),
packages/evaluate/src/weathergen/evaluate/utils/utils.py:616: in calc_val
    return x.max(dim=("ipoint")).values
.venv/lib/python3.12/site-packages/xarray/core/_aggregations.py:2820: in max
    return self.reduce(
.venv/lib/python3.12/site-packages/xarray/core/dataarray.py:3857: in reduce
    var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
.venv/lib/python3.12/site-packages/xarray/core/variable.py:1681: in reduce
    result = super().reduce(
.venv/lib/python3.12/site-packages/xarray/namedarray/core.py:920: in reduce
    data = func(self.data, axis=axis, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

values = dask.array<where, shape=(2, 40320, 1), dtype=float32, chunksize=(1, 672, 1), chunktype=numpy.ndarray>
axis = 1, skipna = None, kwargs = {}
xp = <module 'numpy' from '/users/lessig/santis/WeatherGenerator/.venv/lib/python3.12/site-packages/numpy/__init__.py'>
func = None

    def f(values, axis=None, skipna=None, **kwargs):
        if kwargs.pop("out", None) is not None:
            raise TypeError(f"`out` is not valid for {name}")
    
        # The data is invariant in the case of 0d data, so do not
        # change the data (and dtype)
        # See https://github.com/pydata/xarray/issues/4885
        if invariant_0d and axis == ():
            return values
    
        xp = get_array_namespace(values)
        values = asarray(values, xp=xp)
    
        if coerce_strings and dtypes.is_string(values.dtype):
            values = astype(values, object)
    
        func = None
        if skipna or (
            skipna is None
            and (
                dtypes.isdtype(
                    values.dtype, ("complex floating", "real floating"), xp=xp
                )
                or dtypes.is_object(values.dtype)
            )
        ):
>           from xarray.computation import nanops
E           ImportError: cannot import name 'nanops' from 'xarray.computation' (/users/lessig/santis/WeatherGenerator/.venv/lib/python3.12/site-packages/xarray/computation/__init__.py)

.venv/lib/python3.12/site-packages/xarray/core/duck_array_ops.py:519: ImportError
------------------------------------------------- Captured log setup --------------------------------------------------
INFO     small_multi_stream_test:small_multi_stream_test.py:49 setup fixture with test_multi_stream_77fb7
-------------------------------------------------- Captured log call --------------------------------------------------
INFO     small_multi_stream_test:small_multi_stream_test.py:59 test_train_multi_stream with run_id test_multi_stream_77fb7 /users/lessig/santis/WeatherGenerator
INFO     weathergen.common.config:config.py:505 Loading private config from platform-env.py: /users/lessig/santis/WeatherGenerator-private/hpc/platform-env.py.
INFO     weathergen.common.config:config.py:524 Detected HPC: santis.
INFO     weathergen.common.config:config.py:530 Loading private config from platform-env.py output: /users/lessig/santis/WeatherGenerator-private/hpc/santis/config/paths.yml.
INFO     weathergen.common.config:config.py:481 Using existing config as overwrite: {}.
INFO     weathergen.common.config:config.py:550 Loading specified base config from file: /users/lessig/santis/WeatherGenerator/integration_tests/small_multi_stream.yaml.
INFO     weathergen.common.config:config.py:446 Using assigned run_id: test_multi_stream_77fb7. If you manually selected this run_id, this is an error.
INFO     weathergen.common.config:config.py:505 Loading private config from platform-env.py: /users/lessig/santis/WeatherGenerator-private/hpc/platform-env.py.
INFO     weathergen.common.config:config.py:524 Detected HPC: santis.
INFO     weathergen.common.config:config.py:530 Loading private config from platform-env.py output: /users/lessig/santis/WeatherGenerator-private/hpc/santis/config/paths.yml.
INFO     weathergen.common.config:config.py:505 Loading private config from platform-env.py: /users/lessig/santis/WeatherGenerator-private/hpc/platform-env.py.
INFO     weathergen.common.config:config.py:524 Detected HPC: santis.
INFO     weathergen.common.config:config.py:530 Loading private config from platform-env.py output: /users/lessig/santis/WeatherGenerator-private/hpc/santis/config/paths.yml.
------------------------------------------------ Captured log teardown ------------------------------------------------
INFO     small_multi_stream_test:small_multi_stream_test.py:53 end fixture
================================================== warnings summary ===================================================
integration_tests/small_multi_stream_test.py: 34 warnings
  /users/lessig/.local/share/uv/python/cpython-3.12.11-linux-aarch64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=249480) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============================================== short test summary info ===============================================
FAILED integration_tests/small_multi_stream_test.py::test_train_multi_stream[test_multi_stream_77fb7] - ImportError: cannot import name 'nanops' from 'xarray.computation' (/users/lessig/santis/WeatherGenerator/.venv/li...
===================================== 1 failed, 34 warnings in 582.15s (0:09:42) ======================================

@clessig
Copy link
Collaborator Author

clessig commented Jan 28, 2026

The parameters also need to be re-tuned. ERA5 converged too slowly, but using a very limited number of channels as before also let to unreliable convergence for NPP-ATMS.

@grassesi
Copy link
Contributor

The parameters also need to be re-tuned. ERA5 converged too slowly, but using a very limited number of channels as before also let to unreliable convergence for NPP-ATMS.

I want to have a more general discussion about the goals of our testing including also Tim if possible. I think we should separate testing the control and data flow from testing for convergence:

  • If I am doing development I want to have a quick and convenient way to check nothing breaks (control/data flow wise). For this it is useful to have a very broad/big model configuration (To cover more code), that runs for max. 5 min. on all platforms.
  • This broad model configuration then requires longer training, if testing for convergence. This defeats the purpose of having something that can be run often, quickly and conveniently.

@clessig
Copy link
Collaborator Author

clessig commented Jan 28, 2026

The parameters also need to be re-tuned. ERA5 converged too slowly, but using a very limited number of channels as before also let to unreliable convergence for NPP-ATMS.

I want to have a more general discussion about the goals of our testing including also Tim if possible. I think we should separate testing the control and data flow from testing for convergence:

  • If I am doing development I want to have a quick and convenient way to check nothing breaks (control/data flow wise). For this it is useful to have a very broad/big model configuration (To cover more code), that runs for max. 5 min. on all platforms.
  • This broad model configuration then requires longer training, if testing for convergence. This defeats the purpose of having something that can be run often, quickly and conveniently.

For ERA5 we had a config that tests/tested convergence (that it starts to converge, but that's enough) in less than 5 min on all platforms. So I still don't fully see why we should have multiple tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infra Issues related to infrastructure

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Reenable evaluation in integration tests

3 participants